uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Add spark dataset converter mnist example scripts #530

Closed liangz1 closed 4 years ago

liangz1 commented 4 years ago

There are 2 end-to-end examples in examples/spark_dataset_converter/:

These examples are tested in examples/spark_dataset_converter/tests/test_converter_examples.py. The dataset is mnist in libsvm format, downloaded by from examples.spark_dataset_converter.utils import download_mnist_libsvm.

There are also a few fixes:

  1. Removed check_parent_url. We will directly read from the spark conf.
  2. I used tensorflow==1.15.0 for the examples in order to use keras.losses.SparseCategoricalCrossentropy.
  3. Refactored the import from tensorflow.python.framework.errors_impl import OutOfRangeError since it is not in tensorflow==1.15.0
  4. In _default_delete_dir_handler, don't try to delete the file if the file does not exist.
codecov[bot] commented 4 years ago

Codecov Report

Merging #530 into master will increase coverage by 0.21%. The diff coverage is 90.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #530      +/-   ##
==========================================
+ Coverage   86.32%   86.53%   +0.21%     
==========================================
  Files          81       85       +4     
  Lines        4525     4694     +169     
  Branches      731      737       +6     
==========================================
+ Hits         3906     4062     +156     
- Misses        505      515      +10     
- Partials      114      117       +3     
Impacted Files Coverage Δ
setup.py 0.00% <ø> (ø)
petastorm/spark/spark_dataset_converter.py 92.69% <57.14%> (-0.03%) :arrow_down:
..._dataset_converter/tensorflow_converter_example.py 85.10% <85.10%> (ø)
...ark_dataset_converter/pytorch_converter_example.py 94.00% <94.00%> (ø)
..._dataset_converter/tests/test_converter_example.py 100.00% <100.00%> (ø)
examples/spark_dataset_converter/utils.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a768179...0588339. Read the comment docs.

WeichenXu123 commented 4 years ago

@liangz1 I fixed errors. Now you need to add pylint back on example code, and fix those pylint errors.