lorenzoh / DataLoaders.jl

A parallel iterator for large machine learning datasets that don't fit into memory inspired by PyTorch's `DataLoader` class.
https://lorenzoh.github.io/DataLoaders.jl/docs/dev/interactive
MIT License
76 stars 9 forks source link

Rename/PR to MLDataPattern.jl #9

Open darsnack opened 4 years ago

darsnack commented 4 years ago

Great work!

I've been working on a similar idea, and I was wondering if you would consider making this work a PR to MLDataPattern.jl? The key feature here is the async iterator, and I was planning on adding such an iterator to MLDataPattern.jl. Features like collation and batching can be done as modifications to the existing BatchView in MLDataPattern.jl.

lorenzoh commented 4 years ago

Reposting here an adapted version from our Zulip convo:

Would definitely consider it. I've tried to write DataLoaders.jl in similarly composable pieces as MLDataPattern.jl. The DataLoader interface is more like a thin wrapper around the following pieces:

  • batchviewcollated: like MLDataPattern.batchview, but collates the batches while still supporting getobs!
  • GetObsAsync: makes a data iterator from your data container, but loads samples off the main thread with multiple workers.
  • BufferGetObsAsync: like MLDataPattern.eachobs, but loads data in parallel as the above. supports inplace getobs! with a ring buffer.

I would like to see if the functionality is stable before merging it into MLDataPattern.jl, since I'm not great when it comes to parallel programming and there might be some subtle bugs still lurking in the code.