uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Support reading from a partitioned dataset. Interpret types of the partition-by scalars properly. Also, remove dependency on pyspark while reading using make_batch_reader. #383

Closed selitvin closed 5 years ago

selitvin commented 5 years ago

We use pyarrow resolved types for the partition-by key (in unischema.py) and propagate these types to the proper configuration of UnischemaField.

This diff also removes import pyspark statements from the make_petatsorm_reader import/execution path to make sure it can be used without having pyspark installed in the python environment.