pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

Linter for DataPipe/DataLoader2 #364

Open NivekT opened 2 years ago

NivekT commented 2 years ago

🚀 The feature

This issue proposes the addition of a linter for DataPipes and DataLoader2. The linter can analyze the graph of DataPipes and input arguments to DataLoaderV, and inform the users if any errors may occur ahead of time. The incomplete list of issues that the linter may try to analyze and raise is below. Please feel free to edit the list directly to add more or comment below.

Essential:

Nice-to-have:

Motivation, pitch

Having a linter will encourage best practices of DataPipe usages and reduces the number of unexpected bugs/behaviors in the data loading process during runtime.

Alternatives

Only raise exceptions during runtime.

Additional context

This linter is expected to work with DataPipes and DataLoaderV2. We should consider if it should work with the original DataLoader as well (and how).

cc: @VitalyFedyunin @ejguan

pmeier commented 2 years ago

IIRC, we should enforce shuffling before sharding. cc @NicolasHug

VitalyFedyunin commented 2 years ago

If datasource is already shuffled to some extent, shard before shuffle might be valid operation. So warning IMO suffice.

VitalyFedyunin commented 2 years ago

We should add one more linter to check DataPipe object size and warn if it is too big (ex premature initialization of large structures).

ejguan commented 2 years ago

429 This PR introduces the linter to check if there is a shuffle before each sharding. This should mainly be used by domain libraries to verify their implementation. Note: users can still disable shuffle using DL2

For user-facing linter, we might provide a variant from this linter function. Raise warning when no shuffle before sharding and raise Error/warning if there is a shuffle behind `sharding.

VitalyFedyunin commented 2 years ago

[ ] Warn if filter appears between on_disk_cache and end_caching sections.