uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Reader: shuffle row groups before sharding. #662

Closed chongxiaoc closed 3 years ago

chongxiaoc commented 3 years ago

Splitting row groups by % num_shards is not a good idea to split data. This commit introduce a shard_seed parameter to random shuffle row groups before sharding.

codecov[bot] commented 3 years ago

Codecov Report

Merging #662 (3fe510b) into master (10e0fc8) will increase coverage by 0.01%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #662      +/-   ##
==========================================
+ Coverage   85.22%   85.23%   +0.01%     
==========================================
  Files          85       85              
  Lines        4981     4985       +4     
  Branches      791      792       +1     
==========================================
+ Hits         4245     4249       +4     
  Misses        596      596              
  Partials      140      140              
Impacted Files Coverage Δ
petastorm/reader.py 89.67% <100.00%> (+0.19%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 10e0fc8...3fe510b. Read the comment docs.

chongxiaoc commented 3 years ago

@selitvin I addressed the comments, can you take one more look?

chongxiaoc commented 3 years ago

@selitvin please take a look again, fixed codecov issues.

chongxiaoc commented 3 years ago

@selitvin updated release notes