uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k stars 281 forks source link

Reader: enable shuffling inside every row group #767

Closed chongxiaoc closed 1 year ago

chongxiaoc commented 1 year ago
codecov[bot] commented 1 year ago

Codecov Report

Merging #767 (5d18c4b) into master (3f24800) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #767      +/-   ##
==========================================
+ Coverage   86.26%   86.30%   +0.03%     
==========================================
  Files          85       85              
  Lines        5081     5095      +14     
  Branches      783      786       +3     
==========================================
+ Hits         4383     4397      +14     
  Misses        559      559              
  Partials      139      139              
Impacted Files Coverage Δ
petastorm/arrow_reader_worker.py 91.19% <100.00%> (+0.28%) :arrow_up:
petastorm/py_dict_reader_worker.py 95.58% <100.00%> (+0.13%) :arrow_up:
petastorm/reader.py 90.86% <100.00%> (+0.16%) :arrow_up:
petastorm/workers_pool/ventilator.py 93.33% <100.00%> (+0.09%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

chongxiaoc commented 1 year ago

Attach some benchmark result using synthetic dataset with PyTorch:

Synthetic Dataset PyTorch Throughput

Using shuffle in reader is expected to generate higher throughput since multiple workers are shuffling in parallel.

fyi @selitvin