uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

exposed pyarrow filters in the make_reader and make_batch_reader api #564

Closed abditag2 closed 4 years ago

codecov[bot] commented 4 years ago

Codecov Report

Merging #564 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #564   +/-   ##
=======================================
  Coverage   85.67%   85.67%           
=======================================
  Files          87       87           
  Lines        4976     4978    +2     
  Branches      794      794           
=======================================
+ Hits         4263     4265    +2     
  Misses        577      577           
  Partials      136      136           
Impacted Files Coverage Δ
petastorm/arrow_reader_worker.py 92.05% <100.00%> (+0.05%) :arrow_up:
petastorm/py_dict_reader_worker.py 95.27% <100.00%> (+0.03%) :arrow_up:
petastorm/reader.py 90.24% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0a7f32a...c86a101. Read the comment docs.

bobingm commented 4 years ago

I just noticed this change. Thank you for supporting this. Is that possible to support passing filters for all columns by setting use_legacy_dataset as False in pyarrow.parquet.ParquetDataset[1]?

[1] https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html?highlight=parquetdataset