Open richfitz opened 6 years ago
I think it would be extremely useful to have RDS-like drivers that overcome the serialization bottleneck. Would it be feasible to store unserialized binary blobs instead of RDS files? Can we leverage fst
or Arrow for large data frames and still accommodate arbitrary data objects elsewhere in the same storr
? Does thor
make some of these points moot?
unserialized binary blobs don't really exist - there is not a linear memory map for all but the simplest structures. fst
is only going to work if every object serialised is a data.frame. Arrow is a possibility (it's the technology behind feather
I think) but that would require waiting for a significant amount of work do be done.
thor
still requires serialising R objects. It's probably a little faster than rds with compression turned off, but still has to pay the cost of serialising.
It's ultimately a performance/generality tradeoff. If there is a storr backend that serialises only simple types (atomic types, lists, and therefore data.frame's) it will choke as soon as something adds an exotic attribute.
What possibly could be done by someone sufficiently motivated would be to write a replacement for readRDS
and writeRDS
that did a few different behaviours based on type to efficiently serialise out the most common structures (though I suspect that only data.frame
objects will see a big saving here. The reader would need to check the magic number of the files before reading them in. With that in place, a driver that directly extended the rds one would be trivial to write.
What about an RDS-like driver with the ability to choose how individual objects are loaded and saved? We might be able to store the optional custom saving/loading methods in the key files. This could be especially useful in drake
-powered deep learning workflows because Keras models require their own serialization. saveRDS(keras_model)
unfortunately does not preserve the data correctly.
my_storr$set(
key = "model",
value = keras_model,
save = keras::save_model_hdf5(value, file),
load = keras::load_model_hdf5(file)
)
my_storr$get(key = "model")
With storr
as it is now, we could theoretically just call my_storr$set("model", keras::serialize_model(keras_model))
and then keras::unserialize_model(my_storr$get("model"))
, but that would serialize a big object twice. We could try to skip base::serialize()
, but then we would just end up calling base::unserialize()
on an object that really needs keras::unserialize_model()
.
This would be possible to implement. We would need to know for each special case:
keras::serialize_model
here would be fine)89 48 44 46 0d 0a 1a 0a
and I've dug out the numbers for rds beforeThis requires a bit of fiddling around with the current hash functions, but it could be possible.
The limitation would be that you'd pay a little extra I/O cost on each deserialisation because you'd need to check the first few bytes then read the whole thing, and if you had two things that serialised down to a format with the same magic number but different formats you'd be stuffed (so for example if keras saves models in an hdf5 format of one flavour and another thing in a slightly different hdf5 format with a different load function it just would not work).
It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras
directory and return a reference to it"
Interesting. I was assuming we would need to store a deserialization reference somewhere else, like a key file, but it sounds like those first few bytes could save us some bookkeeping. Any reading material you would recommend on serialization internals?
It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"
I have not decided whether to have drake
automatically do this with Keras targets in the backend, but if it leads to a nice framework, drake
might accommodate Arrow in the same way. For now, I am proposing this workaround on the user side.
Hmm... my comment just now is quite long and very specific. I will relocate it to a new issue.
@richfitz, I am coming back to your suggestion from the bottom of https://github.com/richfitz/storr/issues/77#issuecomment-476297237. I am proposing a decorated storr
for drake
: https://github.com/ropensci/drake/issues/971#issuecomment-517971052. Is this something you think would be helpful for a user base beyond drake
?