uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

bake mnist data into docker image #566

Closed abditag2 closed 4 years ago

abditag2 commented 4 years ago

The link to download mnist data is very slow and tests time out. this diff bakes the mnist data into the docker image.

codecov[bot] commented 4 years ago

Codecov Report

Merging #566 into master will decrease coverage by 3.01%. The diff coverage is 18.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #566      +/-   ##
==========================================
- Coverage   85.78%   82.77%   -3.02%     
==========================================
  Files          87       87              
  Lines        4968     4969       +1     
  Branches      792      793       +1     
==========================================
- Hits         4262     4113     -149     
- Misses        572      723     +151     
+ Partials      134      133       -1     
Impacted Files Coverage Δ
...ark_dataset_converter/pytorch_converter_example.py 0.00% <0.00%> (-94.00%) :arrow_down:
..._dataset_converter/tensorflow_converter_example.py 0.00% <0.00%> (-85.11%) :arrow_down:
examples/spark_dataset_converter/utils.py 31.25% <25.00%> (-68.75%) :arrow_down:
..._dataset_converter/tests/test_converter_example.py 58.33% <50.00%> (-41.67%) :arrow_down:
petastorm/spark/spark_dataset_converter.py 90.26% <0.00%> (-1.50%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1ed3f60...1299d24. Read the comment docs.

abditag2 commented 4 years ago

Hi @abditag2 - thanks for your contribution here! However, I would prefer that we do not bake data into the main container. If downloading MNIST data is an inconvenience, can we find an alternative here? How does storing it in the container fix this issue for you? If you need to keep it in a container, would you be willing to create a second Dockerfile that inherits from the main one and then use that?

@jsgoller1 The link to download the MNIST dataset is a university link and at times it is extremely slow i.e., 3kb/s slow. This diff bakes the data into the docker and for the tests, eliminates the data download requirements. When running the tests, they read the data from local folder rather than downloading every time.

However, I have updated the tests to use the download if the data is not present in the docker image. So, it does not overwrite previous behavior.

Please look at my other diff. We built a docker image using this branch and I am using this docker in the other diff to unblock tests.