uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Added is_petastorm_compatible to make_reader #520

Closed abditag2 closed 4 years ago

abditag2 commented 4 years ago

This allows make_reader to read datasets that are compatible with petastorm but not generated with it. For instance, a dataset with primary types such as ints and floats arrays.

codecov[bot] commented 4 years ago

Codecov Report

Merging #520 into master will decrease coverage by 0.03%. The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
- Coverage   86.02%   85.99%   -0.04%     
==========================================
  Files          81       81              
  Lines        4402     4405       +3     
  Branches      704      705       +1     
==========================================
+ Hits         3787     3788       +1     
- Misses        504      505       +1     
- Partials      111      112       +1     
Impacted Files Coverage Δ
petastorm/reader.py 90.18% <50.00%> (-0.81%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0b70510...b6aa4f9. Read the comment docs.

selitvin commented 4 years ago

@abditag2, do we want to land this?

abditag2 commented 4 years ago

@evgeny-goldin It is not needed anymore. Given that make_batch_reader is faster for the target dataset, It is probably better to raise the exception so the users use the faster kind of reader. I will close this.

I HorovodEstimators, I ended up using both types of readers. If there is a transform to happen, I use make_reader otherwise, I use make_batch_reader.