Open TylerSpears opened 2 years ago
Thanks for sharing this @TylerSpears.
We originally picked pickle version 3 for backwards compatibility, but as you point out it has this limitation on data size.
As you correctly point out, changing the pickle version is a breaking change since it changes many hashes for values.
We have gathered several design improvements that break hashes and are considering if some future redun version should adopt them all in one go, but haven't settled on a plan just yet.
Thanks for sharing some suggestions.
Switching to a more robust serialization tool like dill or cloudpickle (same problems as before with cache breakage)
My understanding of cloudpickle is that it is appropriate for over-the-wire serialization, but is problematic for persistence (to a db) because it requires the same python interpreter version for serialization and deserialization.
Perhaps redun can wrap large objects such that only their hashes are included in the upstream expressions (which would require Value-level serialization beyond pickle 3).
This a good idea that we have be considered for other benefits as well. Basically, we have though through a ValueRef-like object that references the large value. The large value is fetched lazily as needed. Expressions containing the ValueRef can then remain small.
I've also tried making a custom ProxyValue
Subclassing ProxyValue is official way to customize serializing, hash, etc. We have one builtin class, FileCache
, which stores the serialization in a File. That may work for you or inspire a solution. Looking at that code again I see it needs a get_hash()
method added that avoids calling pickle_dumps
, but that is an easy addition. Let me know if this class gives you a sufficient workaround until the pickle version is upgraded.
Just chiming in to say my team would also benefit from a ValueRef type of object, both for large objects, and also for objects with class dependencies not installed in the environment where the main redun scheduler is running.
I am building a data pipeline with redun where large
numpy.ndarray
s are passed through different task functions. However, I can very easily hit this overflow exception:OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher
The cause of this issue seems straightforward - redun limits the pickle protocol to version 3 https://github.com/insitro/redun/blob/0cd06c8147700f67b777b5a43a6d3e3925274bff/redun/utils.py#L40 which cannot serialize objects beyond 4 GiB in size. Whenever redun tries to serialize a decent sized numpy array, either the
pickle_dumps
orpickle_dump
function cannot handle the object. It seems like this serialization is used in value hashing, so any use of a large ndarray breaks redun even without caching the arrays. I can provide code to re-create this, but it's pretty easy to do.Are there any plans to fix this constraint? Passing large arrays/tensors beyond 4 GiB in size is very common, and this seems like a blindspot in redun that would be hit frequently. It's certainly the largest blocker for us being able to use redun, which we would love to do given how elegant and straightforward the library is so far (and the lovely
script
tasks!).Some ideas for solutions:
I'm not really an expert in this, so perhaps these ideas are not workable in redun.
Another easy(er) workaround would be to use file-based objects, but this would lead to tons of file I/O and all arrays would have to be stored on the hard drive, even ones that don't need to be cached. Not to mention that the constant "read file -> perform operation -> save file" loop is very clunky and makes code unreadable and inflexible. I've also tried making a custom
ProxyValue
class that did custom serialization, but I just kept running in circles between the type registry, the argument preprocessing, the differences betweenserialization
,deserialization
,__setstate__
,__getstate__
, etc.I don't think my software version matters, but just to be thorough: