Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k
stars
285
forks
source link
[WIP] Use pyarrow serialization with `make_reader` by default #510
PyArrow serialization can be much faster when serializing numpy arrays,
however, it will mess up numpy scalar types (e.g. np.int8 will become
int when deserialized).
We switch pyarrow serialization on by default and not test scalar types
in tests.
PyArrow serialization can be much faster when serializing numpy arrays, however, it will mess up numpy scalar types (e.g. np.int8 will become int when deserialized).
We switch pyarrow serialization on by default and not test scalar types in tests.