62 classes can be repopulated even if their code has changed. We should add classes as code depenencies.
The hashing code is doing something very similar to pickling; take a python object, serialise it to a bytestream.
The bytestream is then digested using BLAKE3 instead of being written to a file. We should define digest-equality as just meaning that two object's pickle-digests are the same.
Some caveats:
pickling doesn't necessarily map an object to a unique bytestream. Eg set(x,y) and set(y,x) might pickle in a different order. Solution: custom reductor dispatch for sets and dicts. Note that we might have x == y but hash(x) ≠ hash(y) so it's ok.
pickling uses framed streams, the hasher should ignore frames.
out-of-band might reference a stream. Solution: make hasher aware of out-of-band.
pickling doesn't support lambdas, closures and other weird things. Solution: use external objects to represent versioned code (qualname × digest pair). Need to look in to how pickle does closures.
extra feature: we can represent objects that are already saved as external references, saving space. Eg we could have a memo'd function return DataLoader(my_dataset), and the pickle could re-use the dataset's pickle blob.
With all of these adjustments I think we will have a nice, performant object digester that works well with any picklable object.
We still won't have the object's digest be the same as the pickle's blob-digest, but I can imagine later modifying pickle to be HitSave aware:
There are a few problems with the current way that the client hashes python objects:
62 classes can be repopulated even if their code has changed. We should add classes as code depenencies.
The hashing code is doing something very similar to pickling; take a python object, serialise it to a bytestream. The bytestream is then digested using BLAKE3 instead of being written to a file. We should define digest-equality as just meaning that two object's pickle-digests are the same.
Some caveats:
set(x,y)
andset(y,x)
might pickle in a different order. Solution: custom reductor dispatch for sets and dicts. Note that we might havex == y
buthash(x) ≠ hash(y)
so it's ok.qualname × digest
pair). Need to look in to how pickle does closures.DataLoader(my_dataset)
, and the pickle could re-use the dataset's pickle blob.With all of these adjustments I think we will have a nice, performant object digester that works well with any picklable object.
We still won't have the object's digest be the same as the pickle's blob-digest, but I can imagine later modifying pickle to be HitSave aware: