richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

Other potential backends #77

Open richfitz opened 6 years ago

richfitz commented 6 years ago
wlandau commented 5 years ago

I think it would be extremely useful to have RDS-like drivers that overcome the serialization bottleneck. Would it be feasible to store unserialized binary blobs instead of RDS files? Can we leverage fst or Arrow for large data frames and still accommodate arbitrary data objects elsewhere in the same storr? Does thor make some of these points moot?

richfitz commented 5 years ago

unserialized binary blobs don't really exist - there is not a linear memory map for all but the simplest structures. fst is only going to work if every object serialised is a data.frame. Arrow is a possibility (it's the technology behind feather I think) but that would require waiting for a significant amount of work do be done.

thor still requires serialising R objects. It's probably a little faster than rds with compression turned off, but still has to pay the cost of serialising.

It's ultimately a performance/generality tradeoff. If there is a storr backend that serialises only simple types (atomic types, lists, and therefore data.frame's) it will choke as soon as something adds an exotic attribute.

What possibly could be done by someone sufficiently motivated would be to write a replacement for readRDS and writeRDS that did a few different behaviours based on type to efficiently serialise out the most common structures (though I suspect that only data.frame objects will see a big saving here. The reader would need to check the magic number of the files before reading them in. With that in place, a driver that directly extended the rds one would be trivial to write.

wlandau commented 5 years ago

What about an RDS-like driver with the ability to choose how individual objects are loaded and saved? We might be able to store the optional custom saving/loading methods in the key files. This could be especially useful in drake-powered deep learning workflows because Keras models require their own serialization. saveRDS(keras_model) unfortunately does not preserve the data correctly.

my_storr$set(
  key = "model",
  value = keras_model,
  save = keras::save_model_hdf5(value, file),
  load = keras::load_model_hdf5(file)
)

my_storr$get(key = "model")

With storr as it is now, we could theoretically just call my_storr$set("model", keras::serialize_model(keras_model)) and then keras::unserialize_model(my_storr$get("model")), but that would serialize a big object twice. We could try to skip base::serialize(), but then we would just end up calling base::unserialize() on an object that really needs keras::unserialize_model().

richfitz commented 5 years ago

This would be possible to implement. We would need to know for each special case:

This requires a bit of fiddling around with the current hash functions, but it could be possible.

The limitation would be that you'd pay a little extra I/O cost on each deserialisation because you'd need to check the first few bytes then read the whole thing, and if you had two things that serialised down to a format with the same magic number but different formats you'd be stuffed (so for example if keras saves models in an hdf5 format of one flavour and another thing in a slightly different hdf5 format with a different load function it just would not work).

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

wlandau commented 5 years ago

Interesting. I was assuming we would need to store a deserialization reference somewhere else, like a key file, but it sounds like those first few bytes could save us some bookkeeping. Any reading material you would recommend on serialization internals?

It might be worth thinking if you just want to special case these beasts though; it's going to put extra complexity somewhere and it's probably worth thinking about of you want to put that into a very fiddly configuration of the storr driver or if you want to just go "oh you're doing keras stuff, let me save a copy of that into a special keras directory and return a reference to it"

I have not decided whether to have drake automatically do this with Keras targets in the backend, but if it leads to a nice framework, drake might accommodate Arrow in the same way. For now, I am proposing this workaround on the user side.

wlandau commented 5 years ago

Hmm... my comment just now is quite long and very specific. I will relocate it to a new issue.

wlandau commented 5 years ago

@richfitz, I am coming back to your suggestion from the bottom of https://github.com/richfitz/storr/issues/77#issuecomment-476297237. I am proposing a decorated storr for drake: https://github.com/ropensci/drake/issues/971#issuecomment-517971052. Is this something you think would be helpful for a user base beyond drake?