Closed davidkleiven closed 4 years ago
Thanks for this pull requests, I've been hoping that we get around to improving blob support in dataset for a while now. One question: can this be done in a way that doesn't incur a hard dependency on numpy? That's a fairly large piece of binary dependency and it's not required for dataset to function -- perhaps we can fake it somehow by having some checks for import errors when numpy is not installed?
Good point, it is better to not introduce the additional dependency on Numpy. I am thinking about using this package for some scientific applications and we use Numpy arrays a lot, so support for Numpy arrays would be very nice thing. I will look into this, and probably introduce some placeholders such that the code works when numpy is not installed.
To be clear: I'm very happy to merge this, but it would just be cool to hide numpy dependencies in helpers like this:
def is_numpy_array(obj):
try:
from numpy import ndarray
return isinstance(obj, ndarray)
except ImportError:
return False
Yes, this is indeed a good solution. I also created placeholders for numpy.load and numpy.save in case numpy is not imported such that the only place it tries to import numpy is in the file numpy_util.py. I was not completely sure on how to deal with Travis such that it tests that it works both when numpy is installed and not so I created an environment variable DEACTIVATE_NUMPY that can be used to explicitly deactivate numpy, but I am not sure if that is the way to do it.
Nice work! @pudo / @davidkleiven is this good to merge after merge conflicts have been resolved?
@pudo happy to merge?
I generally like the idea of supporting some more convenient datatypes. But a couple of questions arise for me:
DEACTIVATE_NUMPY
. Can't Travis just conditionally install numpy based on the value of the var?convert_blobs
on all row values or only when numpy is actually installed? Could there be some explicit config activation or hook install to enable numpy support?pickle
or marshal
?@stefanw your point with DEACTIVATE_NUMPY is a much more elegant solution. The thing is that I didn't manage to do this. Because in the .travis.yml file numpy is not explicitly installed, so I believe that it is installed by default (I might be wrong). If I could get some suggestions on how to conditionally install it I will fix this.
convert_blobs: Yes, it could try to import numpy. If it has numpy it converts blobs, if not it just return the original object.
One question concerning using pickle to get a string representation of the object: Is pickle working if you transfer the database to a computer that uses python 3 instead of python 2? Or if one update the numpy version?
@davidkleiven You could do something similar to the before_script:
section in our .travis.yml
but with a pip install numpy
instead of the psql
command. It's not pretty, but it works.
I would call the numpy check earlier, so it doesn't need to check the blobs everytime.
I guess we shouldn't go down the generic serialization route here. Too many portability and security concerns.
@stefanw I liked the idea of having an explicit config variable to enable numpy support, and thereby invoking convert_blob. In that case convert_row would need to have access to that variable, do you have any thoughts on how we could achieve that?
It is useful to be able to store Numpy arrays in as a large binary object in a database. Here is an implementation that uses Numpy's native binary format to convert an array to a binary string and put it into the database. When reading entries from the database, there is a function that checks if the binary string starts with \x93NUMPY (which is the first 6 bytes of any numpy array) and converts the string back to a numpy array.