chongxiaoc commented 3 years ago

Splitting row groups by % num_shards is not a good idea to split data. This commit introduce a shard_seed parameter to random shuffle row groups before sharding.

codecov[bot] commented 3 years ago

Codecov Report

Merging #662 (3fe510b) into master (10e0fc8) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #662      +/-   ##
==========================================
+ Coverage   85.22%   85.23%   +0.01%     
==========================================
  Files          85       85              
  Lines        4981     4985       +4     
  Branches      791      792       +1     
==========================================
+ Hits         4245     4249       +4     
  Misses        596      596              
  Partials      140      140

Impacted Files	Coverage Δ
petastorm/reader.py	`89.67% <100.00%> (+0.19%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 10e0fc8...3fe510b. Read the comment docs.

chongxiaoc commented 3 years ago

@selitvin I addressed the comments, can you take one more look?

chongxiaoc commented 3 years ago

@selitvin please take a look again, fixed codecov issues.

chongxiaoc commented 3 years ago

@selitvin updated release notes

uber / petastorm

Reader: shuffle row groups before sharding. #662

Codecov Report