Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k
stars
285
forks
source link
Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead? #664
I think BatchedDataLoader is dealing with the case files are larger than memory, so it streams rows from disk into memory, and shuffles data in the meanwhile.
However, if in-memory cache option is enabled, we assume memory is large enough to hold all rows. But current implementation is using two shuffling queues, which makes the assumption memory is 2X larger than all rows.
I think for this case, we can create a pure in memory dataloader, which loads all rows, then shuffle indices in each epoch, pretty much like pytorch distributed sampler is doing. This makes dataloading and shuffling much easier to understand, for in-memory case, and is memory efficient.
I think BatchedDataLoader is dealing with the case files are larger than memory, so it streams rows from disk into memory, and shuffles data in the meanwhile.
However, if in-memory cache option is enabled, we assume memory is large enough to hold all rows. But current implementation is using two shuffling queues, which makes the assumption memory is 2X larger than all rows.
I think for this case, we can create a pure in memory dataloader, which loads all rows, then shuffle indices in each epoch, pretty much like
pytorch distributed sampler
is doing. This makes dataloading and shuffling much easier to understand, for in-memory case, and is memory efficient.I have some draft code: https://github.com/chongxiaoc/petastorm/commit/5c9998d1fe8895f8c51362f87f7c080c8d5ee5a3
@selitvin What do you think?