uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Simplify data conversion from Spark: support vector type and precision cast #522

Closed liangz1 closed 4 years ago

liangz1 commented 4 years ago

This PR supports the following features in petastorm.spark.make_spark_converter():

test_vector_to_array() will be skipped for pyspark<3.0.0. I tested test_vector_to_array() with pyspark==3.0.0.dev0 locally given the package is not available in PyPI and we cannot test it in the CI yet.

codecov[bot] commented 4 years ago

Codecov Report

Merging #522 into master will increase coverage by 0.06%. The diff coverage is 94.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #522      +/-   ##
==========================================
+ Coverage   86.24%   86.31%   +0.06%     
==========================================
  Files          81       81              
  Lines        4471     4501      +30     
  Branches      718      726       +8     
==========================================
+ Hits         3856     3885      +29     
- Misses        503      504       +1     
  Partials      112      112              
Impacted Files Coverage Δ
petastorm/spark/spark_dataset_converter.py 93.30% <94.28%> (+0.48%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b425e43...9b63933. Read the comment docs.