Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead?

chongxiaoc commented 3 years ago

I think BatchedDataLoader is dealing with the case files are larger than memory, so it streams rows from disk into memory, and shuffles data in the meanwhile.

However, if in-memory cache option is enabled, we assume memory is large enough to hold all rows. But current implementation is using two shuffling queues, which makes the assumption memory is 2X larger than all rows.

I think for this case, we can create a pure in memory dataloader, which loads all rows, then shuffle indices in each epoch, pretty much like pytorch distributed sampler is doing. This makes dataloading and shuffling much easier to understand, for in-memory case, and is memory efficient.

I have some draft code: https://github.com/chongxiaoc/petastorm/commit/5c9998d1fe8895f8c51362f87f7c080c8d5ee5a3

@selitvin What do you think?

selitvin commented 3 years ago

It sounds like a good idea. The new class is indeed easier to understand and maintain.

chongxiaoc commented 3 years ago

Great, I will go ahead to draft a PR.

uber / petastorm

Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead? #664