Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
NdarrayCodec and other codecs not implementing __str__ defeats the purpose of Unischema implementing it, because it's defined recursively. Note that UnischemaField has a string form implicitly as a subclass of namedtuple, but it's being circumvented presumably because it uses default string forms that are not human readable.
Reproduction
>>> from petastorm.codecs import NdarrayCodec
>>> codec = NdarrayCodec()
>>> str(codec)
'<petastorm.codecs.NdarrayCodec object at 0x1027e4e48>'
>>>
>>> from petastorm.unischema import UnischemaField
>>> field = UnischemaField('features', np.float32, (1000,), codec)
>>> str(field)
"UnischemaField(name='features', numpy_dtype=<class 'numpy.float32'>, shape=(1000,), codec=<petastorm.codecs.NdarrayCodec object at 0x1027e4e48>, nullable=False)"
Example
Unischema(Hdf5Schema, [
UnischemaField('features', float32, (1000,), <petastorm.codecs.NdarrayCodec object at 0x11f412b38>, False),
UnischemaField('label', float32, (1,), <petastorm.codecs.NdarrayCodec object at 0x1044286a0>, False),
])
Motivation
NdarrayCodec
and other codecs not implementing__str__
defeats the purpose of Unischema implementing it, because it's defined recursively. Note thatUnischemaField
has a string form implicitly as a subclass ofnamedtuple
, but it's being circumvented presumably because it uses default string forms that are not human readable.Reproduction
Example