tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
663 stars 110 forks source link

Can tfdf work with a streaming tf dataset? #10

Closed sibyjackgrove closed 3 years ago

sibyjackgrove commented 3 years ago

My training data is in a multi GB CSV file. I have built a data pipeline using tf.data to stream this data and do some pre-processing,. Can I use these dataset objects in tfdf model.fit (similar to how it is done in Keras) or does tfdf need the dataset to have all the data stored in memory?

achoum commented 3 years ago

Currently, all the dataset needs to fit in memory.

You can (and it is a good idea in this case) to feed the dataset as a stream using a tf.dataset. See the dataset section of the migration guide for more details. However, the memory consumption will still be ~4bytes per values + index.

See my comments on this issues for some details on how to optimize the ram consumption.