uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

NdarrayCodec does not implement __str__ #559

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

Motivation

NdarrayCodec and other codecs not implementing __str__ defeats the purpose of Unischema implementing it, because it's defined recursively. Note that UnischemaField has a string form implicitly as a subclass of namedtuple, but it's being circumvented presumably because it uses default string forms that are not human readable.

Reproduction

>>> from petastorm.codecs import NdarrayCodec
>>> codec = NdarrayCodec()
>>> str(codec)
'<petastorm.codecs.NdarrayCodec object at 0x1027e4e48>'
>>>
>>> from petastorm.unischema import UnischemaField
>>> field = UnischemaField('features', np.float32, (1000,), codec)
>>> str(field)
"UnischemaField(name='features', numpy_dtype=<class 'numpy.float32'>, shape=(1000,), codec=<petastorm.codecs.NdarrayCodec object at 0x1027e4e48>, nullable=False)"

Example

Unischema(Hdf5Schema, [
  UnischemaField('features', float32, (1000,), <petastorm.codecs.NdarrayCodec object at 0x11f412b38>, False),
  UnischemaField('label', float32, (1,), <petastorm.codecs.NdarrayCodec object at 0x1044286a0>, False),
])