Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k
stars
284
forks
source link
Support reading from a partitioned dataset. Interpret types of the partition-by scalars properly. Also, remove dependency on pyspark while reading using make_batch_reader. #383
We use pyarrow resolved types for the partition-by key (in unischema.py) and propagate these types to the proper configuration of UnischemaField.
This diff also removes import pyspark statements from the make_petatsorm_reader import/execution path to make sure it can be used without having pyspark installed in the python environment.
We use pyarrow resolved types for the partition-by key (in
unischema.py
) and propagate these types to the proper configuration of UnischemaField.This diff also removes
import pyspark
statements from themake_petatsorm_reader
import/execution path to make sure it can be used without having pyspark installed in the python environment.