Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k
stars
281
forks
source link
make_batch_reader loses dtype with list-of-strings columns, causing Tensorflow error when lists contain a None value #765
The dtype is lost using make_batch_reader when a column is a list of strings. If a batch contains no None values, then Tensorflow is still able to infer the string type of the array. But if the batch contains any None values, then Tensorflow produces the following error:
Modify all TransformSpec funcs to replace list-of-string columns with any missing string values with None strings.
def is_string_list(column_type: pyarrow.DataType) -> bool:
return isinstance(column_type, pyarrow.ListType) and pyarrow.types.is_string(column_type.value_type)
fields_str_list = [f for f in table.schema.names if is_string_list(table.column(f).type)]
def transform_spec_with_workaround(rows_pd: pd.DataFrame) -> pd.DataFrame:
... # custom transformation
for f in fields_str_list:
rows_pd[f] = rows_pd[f].map(lambda a: np.ma.masked_values(a, None).filled('None'))
return rows_pd
Full Trace
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
<ipython-input-50-55cc7ab782db> in <module>
----> 1 dataset_itr.next()
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in next(self)
4693
4694 def next(self):
-> 4695 return self.__next__()
4696
4697
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in __next__(self)
4690 return numpy
4691
-> 4692 return nest.map_structure(to_numpy, next(self._iterator))
4693
4694 def next(self):
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
759 def __next__(self):
760 try:
--> 761 return self._next_internal()
762 except errors.OutOfRangeError:
763 raise StopIteration
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
745 self._iterator_resource,
746 output_types=self._flat_output_types,
--> 747 output_shapes=self._flat_output_shapes)
748
749 try:
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next(iterator, output_types, output_shapes, name)
2726 return _result
2727 except _core._NotOkStatusException as e:
-> 2728 _ops.raise_from_not_ok_status(e, name)
2729 except _core._FallbackException:
2730 pass
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
6939 message = e.message + (" name: " + name if name is not None else "")
6940 # pylint: disable=protected-access
-> 6941 six.raise_from(core._status_to_exception(e.code, message), None)
6942 # pylint: enable=protected-access
6943
~/conda/lmigpuv0_17_1/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InternalError: Unsupported object type NoneType
[[{{node PyFunc}}]] [Op:IteratorGetNext]
The dtype is lost using
make_batch_reader
when a column is a list of strings. If a batch contains noNone
values, then Tensorflow is still able to infer the string type of the array. But if the batch contains anyNone
values, then Tensorflow produces the following error:Note that this similar but different than issue https://github.com/uber/petastorm/issues/744.
Example
Workaround
Modify all TransformSpec funcs to replace list-of-string columns with any missing string values with None strings.
Full Trace