uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead? #664

Open chongxiaoc opened 3 years ago

chongxiaoc commented 3 years ago

I think BatchedDataLoader is dealing with the case files are larger than memory, so it streams rows from disk into memory, and shuffles data in the meanwhile.

However, if in-memory cache option is enabled, we assume memory is large enough to hold all rows. But current implementation is using two shuffling queues, which makes the assumption memory is 2X larger than all rows.

I think for this case, we can create a pure in memory dataloader, which loads all rows, then shuffle indices in each epoch, pretty much like pytorch distributed sampler is doing. This makes dataloading and shuffling much easier to understand, for in-memory case, and is memory efficient.

I have some draft code: https://github.com/chongxiaoc/petastorm/commit/5c9998d1fe8895f8c51362f87f7c080c8d5ee5a3

@selitvin What do you think?

selitvin commented 3 years ago

It sounds like a good idea. The new class is indeed easier to understand and maintain.

chongxiaoc commented 3 years ago

Great, I will go ahead to draft a PR.