pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
https://dataset.readthedocs.org/
MIT License
4.76k stars 297 forks source link

Support for storing Numpy arrays as LargeBinary objects #254

Closed davidkleiven closed 4 years ago

davidkleiven commented 6 years ago

It is useful to be able to store Numpy arrays in as a large binary object in a database. Here is an implementation that uses Numpy's native binary format to convert an array to a binary string and put it into the database. When reading entries from the database, there is a function that checks if the binary string starts with \x93NUMPY (which is the first 6 bytes of any numpy array) and converts the string back to a numpy array.

pudo commented 6 years ago

Thanks for this pull requests, I've been hoping that we get around to improving blob support in dataset for a while now. One question: can this be done in a way that doesn't incur a hard dependency on numpy? That's a fairly large piece of binary dependency and it's not required for dataset to function -- perhaps we can fake it somehow by having some checks for import errors when numpy is not installed?

davidkleiven commented 6 years ago

Good point, it is better to not introduce the additional dependency on Numpy. I am thinking about using this package for some scientific applications and we use Numpy arrays a lot, so support for Numpy arrays would be very nice thing. I will look into this, and probably introduce some placeholders such that the code works when numpy is not installed.

pudo commented 6 years ago

To be clear: I'm very happy to merge this, but it would just be cool to hide numpy dependencies in helpers like this:

def is_numpy_array(obj):
      try:
            from numpy import ndarray
            return isinstance(obj, ndarray)
       except ImportError:
             return False
davidkleiven commented 6 years ago

Yes, this is indeed a good solution. I also created placeholders for numpy.load and numpy.save in case numpy is not imported such that the only place it tries to import numpy is in the file numpy_util.py. I was not completely sure on how to deal with Travis such that it tests that it works both when numpy is installed and not so I created an environment variable DEACTIVATE_NUMPY that can be used to explicitly deactivate numpy, but I am not sure if that is the way to do it.

ojongerius commented 6 years ago

Nice work! @pudo / @davidkleiven is this good to merge after merge conflicts have been resolved?

ojongerius commented 6 years ago

@pudo happy to merge?

stefanw commented 6 years ago

I generally like the idea of supporting some more convenient datatypes. But a couple of questions arise for me:

davidkleiven commented 6 years ago

@stefanw your point with DEACTIVATE_NUMPY is a much more elegant solution. The thing is that I didn't manage to do this. Because in the .travis.yml file numpy is not explicitly installed, so I believe that it is installed by default (I might be wrong). If I could get some suggestions on how to conditionally install it I will fix this.

stefanw commented 6 years ago

@davidkleiven You could do something similar to the before_script: section in our .travis.yml but with a pip install numpy instead of the psql command. It's not pretty, but it works.

davidkleiven commented 6 years ago

@stefanw I liked the idea of having an explicit config variable to enable numpy support, and thereby invoking convert_blob. In that case convert_row would need to have access to that variable, do you have any thoughts on how we could achieve that?