uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Ignore an invalid piece created for a subdirectory when a dataset is stored in an s3 bucket subdirectory #588

Closed selitvin closed 4 years ago

selitvin commented 4 years ago

When reading parquet store directly from an s3 bucket, a separate piece is created for root directory. This is not a real "piece" and we won't have row_groups_per_file recorded for it.

codecov[bot] commented 4 years ago

Codecov Report

Merging #588 into master will decrease coverage by 0.01%. The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #588      +/-   ##
==========================================
- Coverage   85.19%   85.18%   -0.02%     
==========================================
  Files          87       87              
  Lines        4993     4994       +1     
  Branches      794      795       +1     
==========================================
  Hits         4254     4254              
  Misses        592      592              
- Partials      147      148       +1     
Impacted Files Coverage Δ
petastorm/etl/dataset_metadata.py 87.41% <66.66%> (-0.59%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4b21a70...d6d500c. Read the comment docs.