Codecov Report

Merging #767 (5d18c4b) into master (3f24800) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #767      +/-   ##
==========================================
+ Coverage   86.26%   86.30%   +0.03%     
==========================================
  Files          85       85              
  Lines        5081     5095      +14     
  Branches      783      786       +3     
==========================================
+ Hits         4383     4397      +14     
  Misses        559      559              
  Partials      139      139

Impacted Files	Coverage Δ
petastorm/arrow_reader_worker.py	`91.19% <100.00%> (+0.28%)`	:arrow_up:
petastorm/py_dict_reader_worker.py	`95.58% <100.00%> (+0.13%)`	:arrow_up:
petastorm/reader.py	`90.86% <100.00%> (+0.16%)`	:arrow_up:
petastorm/workers_pool/ventilator.py	`93.33% <100.00%> (+0.09%)`	:arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

chongxiaoc commented 1 year ago

Attach some benchmark result using synthetic dataset with PyTorch:

Experiment setup: BatchedDataLoader + make_batch_reader(), batch_size=50000, shuffle_buffer_size=1000000, thread_pool, 10 workers.
Datasets from 100M rows to 1.6B rows are tested.
Compare throughput of shuffle in dataloader and shuffle in reader .

Synthetic Dataset PyTorch Throughput

Using shuffle in reader is expected to generate higher throughput since multiple workers are shuffling in parallel.

fyi @selitvin

uber / petastorm

Reader: enable shuffling inside every row group #767

Codecov Report