uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Added additional kwargs for Spark Dataset Converter #568

Closed tgaddair closed 4 years ago

tgaddair commented 4 years ago

Adds support for:

tgaddair commented 4 years ago

@selitvin any idea what's causing these test failures? They seem to be unrelated.

selitvin commented 4 years ago

The failure is in the test_tf_make_reader_fn: ValueError: directory url must be a string - are you sure it's unrelated to your change.

selitvin commented 4 years ago

Adding @mengxr for viz.

tgaddair commented 4 years ago

@WeichenXu123, can you explain why support for make_reader was removed in https://github.com/uber/petastorm/commit/6c49ed5e6f6a193faf0701db46e886b9a5be5d5c#diff-b0f255caa777716ecea31adea665c124 ?

tgaddair commented 4 years ago

@mengxr can you comment on the reason for 6c49ed5? We want to be able to use make_reader with Spark Dataset Converter.

mengxr commented 4 years ago

https://github.com/uber/petastorm/pull/503#issue-386154352

We removed make_reader because we want to support an input URL list. @WeichenXu123 's PR only added that support to make_batch_reader.

codecov[bot] commented 4 years ago

Codecov Report

Merging #568 into master will increase coverage by 0.52%. The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
+ Coverage   85.15%   85.67%   +0.52%     
==========================================
  Files          87       87              
  Lines        4970     4976       +6     
  Branches      793      794       +1     
==========================================
+ Hits         4232     4263      +31     
+ Misses        591      577      -14     
+ Partials      147      136      -11     
Impacted Files Coverage Δ
petastorm/spark/spark_dataset_converter.py 91.24% <77.77%> (+0.19%) :arrow_up:
petastorm/py_dict_reader_worker.py 95.23% <0.00%> (+0.79%) :arrow_up:
petastorm/arrow_reader_worker.py 92.00% <0.00%> (+2.00%) :arrow_up:
petastorm/tf_utils.py 88.65% <0.00%> (+3.54%) :arrow_up:
petastorm/compat.py 100.00% <0.00%> (+39.02%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e5d35e7...1e053e4. Read the comment docs.

tgaddair commented 4 years ago

@selitvin removed the make_reader support for now. I think tests should be passing now, though there seems to be a failure due a timeout. Is there an easy way to rerun the tests?

nightflight-dk commented 4 years ago

greetings from MSFT, early perf tests of this branch show promising results, we look forward to seeing this in master

tgaddair commented 4 years ago

Thanks for trying it out, @nightflight-dk. This has now been merged into master.