Closed fps7806 closed 4 years ago
Merging #540 into master will increase coverage by
3.03%
. The diff coverage is72.77%
.
@@ Coverage Diff @@
## master #540 +/- ##
==========================================
+ Coverage 82.88% 85.92% +3.03%
==========================================
Files 85 87 +2
Lines 4721 4922 +201
Branches 744 780 +36
==========================================
+ Hits 3913 4229 +316
+ Misses 678 569 -109
+ Partials 130 124 -6
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/benchmark/dummy_reader.py | 0.00% <0.00%> (ø) |
|
petastorm/reader.py | 90.73% <ø> (ø) |
|
petastorm/reader_impl/pytorch_shuffling_buffer.py | 96.42% <96.42%> (ø) |
|
petastorm/pytorch.py | 94.21% <97.43%> (+1.53%) |
:arrow_up: |
petastorm/etl/dataset_metadata.py | 88.00% <100.00%> (ø) |
|
petastorm/py_dict_reader_worker.py | 95.23% <0.00%> (+0.79%) |
:arrow_up: |
petastorm/spark/spark_dataset_converter.py | 91.76% <0.00%> (+1.49%) |
:arrow_up: |
petastorm/arrow_reader_worker.py | 92.00% <0.00%> (+2.00%) |
:arrow_up: |
... and 5 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 83a02df...66de6c7. Read the comment docs.
@fps7806 This is great! Thanks a lot. I wonder if you could also add the script you use for measuring loader performance with synthetic data. You could add them to a scripts folder. That'd be useful for future testing.
@selitvin Can you also take a look at this?
The current DataLoader implementation relies on python iteration over individual rows of a parquet dataset. This is reasonable for small datasets but runs into throughput limitations:
Previous timing:
The new approach converts data into torch format BEFORE appending to the shuffling buffer and then relies on batched operations to perform all data appending/sampling.
New timing:
New timing with GPU-buffering:
In the best-case scenario, the samples/s increases 2 orders of magnitude. Although image-based problems would likely not achieve this throughput, these numbers are more representative of the throughput on tabular-data problems.
The proposed changes can be used out-of-the-box in most scenarios.
There are still many improvements that can be made to the BatchedShufflingBuffer logic, but this implementation provides an initial step in that direction. From my own experiments I think any approach that relies on
collate_fn
won't be able to match batched operations in a single thread.As a bonus, I set
metadata_nthreads=10
to make Meta-data faster to load.EDIT: I added a dummy_reader.py to the benchmarks:
python -m petastorm.benchmark.dummy_reader.py cuda 10000 32
The arguments arecpu/cuda
, reader batch size, data shape