pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 153 forks source link

[WIP] Examples for demonstrating the usage and incremental value of TorchData Nodes #1352

Open divyanshk opened 2 weeks ago

divyanshk commented 2 weeks ago

🚀 The feature

Starting this issue to track minimal examples we can create to demonstrate effective usage and value of TorchData nodes. I can create separate issues for each of these as required.

Motivation, pitch

  1. Vanilla torch.utils dataloader usage ported over to torchdata nodes
  2. GPU accelerated transforms
  3. Flexible parallelism (mixing multiprocessing with multithreading)
  4. Examples porting over popular OSS datasets
    • connecting to popular cloud storage
  5. Example creating new nodes (might get covered through examples above)
  6. Basic multimodal model trained E2E using torchdata nodes
  7. Chaining multiple transforms (might get covered through examples above)
  8. Dataset mixing (with different sampling strategies)

Alternatives

No response

Additional context

No response

andrewkho commented 2 weeks ago

Discussions: let's do 1, 2, 4 but with HF, 6 through torchtune, and 8

andrewkho commented 2 weeks ago

Also add tasks for Documentation, READMEs, docstrings, param lists, design doc

andrewkho commented 2 weeks ago

Contribution guide (lower priority)