uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Make "Generate Dataset" example work with newer pyspark #676

Closed selitvin closed 3 years ago

selitvin commented 3 years ago

The current example in the README.rst is failing with _pickle.PicklingError: Could not serialize object: ValueError: Cell is empty error. The root cause of the failure is not clear, but it seems to be affected by the order of import. I assume this is some pyspark pickling issue.

The new example is taken from current version of examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py which is tested by CI.

codecov[bot] commented 3 years ago

Codecov Report

Merging #676 (244f053) into master (7731117) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #676   +/-   ##
=======================================
  Coverage   85.89%   85.89%           
=======================================
  Files          84       84           
  Lines        4956     4956           
  Branches      788      788           
=======================================
  Hits         4257     4257           
  Misses        560      560           
  Partials      139      139           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7731117...244f053. Read the comment docs.