Support for storing Numpy arrays as LargeBinary objects

pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.

https://dataset.readthedocs.org/

MIT License

4.76k stars 297 forks source link

Support for storing Numpy arrays as LargeBinary objects #254

Closed davidkleiven closed 4 years ago

davidkleiven commented 6 years ago

It is useful to be able to store Numpy arrays in as a large binary object in a database. Here is an implementation that uses Numpy's native binary format to convert an array to a binary string and put it into the database. When reading entries from the database, there is a function that checks if the binary string starts with \x93NUMPY (which is the first 6 bytes of any numpy array) and converts the string back to a numpy array.

pudo commented 6 years ago

Thanks for this pull requests, I've been hoping that we get around to improving blob support in dataset for a while now. One question: can this be done in a way that doesn't incur a hard dependency on numpy? That's a fairly large piece of binary dependency and it's not required for dataset to function -- perhaps we can fake it somehow by having some checks for import errors when numpy is not installed?

davidkleiven commented 6 years ago

Good point, it is better to not introduce the additional dependency on Numpy. I am thinking about using this package for some scientific applications and we use Numpy arrays a lot, so support for Numpy arrays would be very nice thing. I will look into this, and probably introduce some placeholders such that the code works when numpy is not installed.

pudo commented 6 years ago

To be clear: I'm very happy to merge this, but it would just be cool to hide numpy dependencies in helpers like this:

def is_numpy_array(obj):
      try:
            from numpy import ndarray
            return isinstance(obj, ndarray)
       except ImportError:
             return False

davidkleiven commented 6 years ago

Yes, this is indeed a good solution. I also created placeholders for numpy.load and numpy.save in case numpy is not imported such that the only place it tries to import numpy is in the file numpy_util.py. I was not completely sure on how to deal with Travis such that it tests that it works both when numpy is installed and not so I created an environment variable DEACTIVATE_NUMPY that can be used to explicitly deactivate numpy, but I am not sure if that is the way to do it.

ojongerius commented 6 years ago

Nice work! @pudo / @davidkleiven is this good to merge after merge conflicts have been resolved?

ojongerius commented 6 years ago

@pudo happy to merge?

stefanw commented 6 years ago

I generally like the idea of supporting some more convenient datatypes. But a couple of questions arise for me:

I don't quite understand why Python needs to know about DEACTIVATE_NUMPY. Can't Travis just conditionally install numpy based on the value of the var?
Do we need to call convert_blobs on all row values or only when numpy is actually installed? Could there be some explicit config activation or hook install to enable numpy support?
Should this all be more pluggable to support other libraries? Or more generic to just use pickle or marshal?

davidkleiven commented 6 years ago

@stefanw your point with DEACTIVATE_NUMPY is a much more elegant solution. The thing is that I didn't manage to do this. Because in the .travis.yml file numpy is not explicitly installed, so I believe that it is installed by default (I might be wrong). If I could get some suggestions on how to conditionally install it I will fix this.

convert_blobs: Yes, it could try to import numpy. If it has numpy it converts blobs, if not it just return the original object.
One question concerning using pickle to get a string representation of the object: Is pickle working if you transfer the database to a computer that uses python 3 instead of python 2? Or if one update the numpy version?

stefanw commented 6 years ago

@davidkleiven You could do something similar to the before_script: section in our .travis.yml but with a pip install numpy instead of the psql command. It's not pretty, but it works.

I would call the numpy check earlier, so it doesn't need to check the blobs everytime.
I guess we shouldn't go down the generic serialization route here. Too many portability and security concerns.

davidkleiven commented 6 years ago

@stefanw I liked the idea of having an explicit config variable to enable numpy support, and thereby invoking convert_blob. In that case convert_row would need to have access to that variable, do you have any thoughts on how we could achieve that?