Closed working-estimate closed 5 years ago
Are you using pyarrow=0.13? If yes, there is #349 which fixes, what I suspect, is the same issue. Can you try to patch that PR and see if it helps?
I do have pyarrow 0.13. With a pull from master (which has that patched in), my error becomes:
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/arrow_reader_worker.py", line 62, in read_next
result_dict[column.name] = np.vstack(list_of_lists.tolist())
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "petastorm_genome.py", line 19, in <module>
python_hello_world()
File "petastorm_genome.py", line 12, in python_hello_world
for schema_view in reader:
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/reader.py", line 648, in __next__
return self._results_queue_reader.read_next(self._workers_pool, self.schema, self.ngram)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/arrow_reader_worker.py", line 67, in read_next
', '.join({value.shape[0] for value in list_of_lists}))
TypeError: sequence item 0: expected str instance, int found
Good. So it was the arrow 0.13 issue. We should probably give a better error message in this case. What happens is that we try to batch samples from multiple rows to produce a matrix. When lists are of a different length, naturally, this is not possible.
This approach worked ok for the original use-case (when all lists were guaranteed to be of the same length, but is probably a poor design choice in general and we had to consume data from Tensorflow (a batch of variable size lists is not compatible with TF tensor data types)).
Can you please provide a little bit more info about your usecase? How do you plan to consume the data? Are you working with Tensorflow/Pytorch/other?
As possibly a temporary solution, you could use make_batch_reader(..., transform_spec=...)
to preprocess your data early into a tensor compatible shape. This is obviously a workaround until we find a more appropriate way to address this.
I'm looking to work with keras using fit_generator
and tensorflow backend. Is there a guide on the usage of transform_spec
?
There are some examples in the documentation, e.g: https://petastorm.readthedocs.io/en/latest/readme_include.html?highlight=transform_spec
You can also look at the way it is being used in the tests, e.g.: https://github.com/uber/petastorm/blob/ccf738e6efdc90f9643bdb6e20e064c7ba924379/petastorm/tests/test_tf_utils.py#L318
@htokgoz1 , is there anything else I can help with in the context of the issue or we can close it?
We can close this, thanks
I have a simple parquet file with the schema:
when I try to read it in with the hello world example, I get many instances of the following error:
from several workers. What could be the root cause here?
Here's the traceback using a dummy pool: