Option to use MMIO for large Datasets

handshape commented 1 year ago

Is your feature request related to a problem? Please describe. I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).

Describe the solution you'd like I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.

Describe alternatives you've considered I've considered subclassing Dataset and reimplementing everything that makes use of the data member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.

The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.

I've also considered buying a ton of physical memory. ;)

Craigacp commented 1 year ago

Yes, the current dataset representation isn't as memory efficient as we'd like it to be, particularly when deserializing from protobufs. The protobuf deserialization path doesn't deduplicate the feature name strings on the way through (unlike Dataset.add or Java serialization), which will increase the memory consumed unless you're using a new enough JDK which automatically does string deduplication in the garbage collector. If your data is dense we're already planning to build a dense example which should have half the memory usage of the current sparse example (as it will only store the feature values not the names).

Could you provide a little detail on the size and shape of the datasets you want to work with? Are they sparse or dense? How many features & examples are there?

Tribuo isn't particularly designed for very large datasets as most of the training methods create a copy of the data to get it into a more compute friendly representation, and the original dataset will still have a reference on the stack during a training call meaning it can't be garbage collected. We're investigating online learning support which will allow some models to scale much further wrt to data size as you'll only need a portion of it in memory while it is used for training, but that won't be possible for all model types.

handshape commented 1 year ago

My current use case has about 80k examples, with about 1000 dense features each. (Short phrases that have been run through BERT or another embedding, plus a few fistfuls of contextual features.)

I anticipate that my number of examples is going to grow exponentially.

Online learning could be made to work (and would be a fantastic addition for other use cases), but I'd much prefer being able to work with the existing Dataset abstraction, and provide my own backing store.

Craigacp commented 1 year ago

Ok, sounds like the dense example will help you quite a bit. Moving to memory mapped IO as a supported Tribuo Dataset class will be hard as neither the protobuf format we're moving to, nor the current java.io.Serializable representation are suitable for that, so we'd need to design a new disk representation and then we'd have to live with it for a long time.

If you're happy with writing your own dataset, then the protobuf serialization mechanisms will accept other classes that implement ProtoSerializable, so you could override all the serialization stuff and have your own Example and Dataset that each implement ProtoSerializable for the wrapper types, but contain your own implementation of the protobuf Message. The serialization infrastructure will then route the deserialization of that message to your class automatically via reflection because the class name is stored in the message. You could then sidestep the examples and just load the dataset with Dataset.deserialize which will call through to YourMMIODataset.deserializeFromProto(int version, String className, Any message).

oracle / tribuo

Option to use MMIO for large Datasets #310