uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

New PyTorch BatchedDataLoader implementation using batched operations #540

Closed fps7806 closed 4 years ago

fps7806 commented 4 years ago

The current DataLoader implementation relies on python iteration over individual rows of a parquet dataset. This is reasonable for small datasets but runs into throughput limitations:

Previous timing:

Samples per second for batch 10: 7.556e+04
Samples per second for batch 100: 1.001e+05
Samples per second for batch 1000: 1.083e+05
Samples per second for batch 100000: 4.364e+04

The new approach converts data into torch format BEFORE appending to the shuffling buffer and then relies on batched operations to perform all data appending/sampling.

New timing:

Samples per second for batch 10: 4.363e+05
Samples per second for batch 100: 1.468e+06
Samples per second for batch 1000: 1.607e+06
Samples per second for batch 100000: 1.141e+06

New timing with GPU-buffering:

Samples per second for batch 10: 3.87e+05
Samples per second for batch 100: 1.95e+06
Samples per second for batch 1000: 2.307e+06
Samples per second for batch 100000: 5.053e+06

In the best-case scenario, the samples/s increases 2 orders of magnitude. Although image-based problems would likely not achieve this throughput, these numbers are more representative of the throughput on tabular-data problems.

The proposed changes can be used out-of-the-box in most scenarios.

There are still many improvements that can be made to the BatchedShufflingBuffer logic, but this implementation provides an initial step in that direction. From my own experiments I think any approach that relies on collate_fn won't be able to match batched operations in a single thread.

As a bonus, I set metadata_nthreads=10 to make Meta-data faster to load.

EDIT: I added a dummy_reader.py to the benchmarks: python -m petastorm.benchmark.dummy_reader.py cuda 10000 32 The arguments are cpu/cuda, reader batch size, data shape

codecov[bot] commented 4 years ago

Codecov Report

Merging #540 into master will increase coverage by 3.03%. The diff coverage is 72.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
+ Coverage   82.88%   85.92%   +3.03%     
==========================================
  Files          85       87       +2     
  Lines        4721     4922     +201     
  Branches      744      780      +36     
==========================================
+ Hits         3913     4229     +316     
+ Misses        678      569     -109     
+ Partials      130      124       -6     
Impacted Files Coverage Δ
petastorm/benchmark/dummy_reader.py 0.00% <0.00%> (ø)
petastorm/reader.py 90.73% <ø> (ø)
petastorm/reader_impl/pytorch_shuffling_buffer.py 96.42% <96.42%> (ø)
petastorm/pytorch.py 94.21% <97.43%> (+1.53%) :arrow_up:
petastorm/etl/dataset_metadata.py 88.00% <100.00%> (ø)
petastorm/py_dict_reader_worker.py 95.23% <0.00%> (+0.79%) :arrow_up:
petastorm/spark/spark_dataset_converter.py 91.76% <0.00%> (+1.49%) :arrow_up:
petastorm/arrow_reader_worker.py 92.00% <0.00%> (+2.00%) :arrow_up:
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 83a02df...66de6c7. Read the comment docs.

abditag2 commented 4 years ago

@fps7806 This is great! Thanks a lot. I wonder if you could also add the script you use for measuring loader performance with synthetic data. You could add them to a scripts folder. That'd be useful for future testing.

abditag2 commented 4 years ago

@selitvin Can you also take a look at this?