uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

make_batch_reader with nullable field, TypeError: an integer is required #354

Closed working-estimate closed 5 years ago

working-estimate commented 5 years ago

I have a simple parquet file with the schema:

 root
 |-- id: string (nullable = true)
 |-- feat1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- feat2: string (nullable = true)
 |-- feat3: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- feat4: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- feat5: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- feat6: integer (nullable = true) 

when I try to read it in with the hello world example, I get many instances of the following error:

Worker 2 terminated: unexpected exception:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/workers_pool/thread_pool.py", line 62, in run
    self._worker_impl.process(*args, **kargs)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 138, in process
    lambda: self._load_rows(parquet_file, piece, shuffle_row_drop_partition))
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/cache.py", line 39, in get
    return fill_cache_func()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 138, in <lambda>
    lambda: self._load_rows(parquet_file, piece, shuffle_row_drop_partition))
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 152, in _load_rows
    result = self._read_with_shuffle_row_drop(piece, pq_file, column_names, shuffle_row_drop_range)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 224, in _read_with_shuffle_row_drop
    partitions=self._dataset.partitions
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyarrow/parquet.py", line 562, in read
    table = reader.read_row_group(self.row_group, **options)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyarrow/parquet.py", line 188, in read_row_group
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 696, in pyarrow._parquet.ParquetReader.read_row_group
TypeError: an integer is required

from several workers. What could be the root cause here?

Here's the traceback using a dummy pool:

Traceback (most recent call last):
  File "petastorm_genome.py", line 19, in <module>
    python_hello_world()
  File "petastorm_genome.py", line 12, in python_hello_world
    for schema_view in reader:
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/reader.py", line 622, in __next__
    return self._results_queue_reader.read_next(self._workers_pool, self.schema, self.ngram)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 42, in read_next
    result_table = workers_pool.get_results()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/workers_pool/dummy_pool.py", line 72, in get_results
    self._worker.process(*args, **kargs)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 138, in process
    lambda: self._load_rows(parquet_file, piece, shuffle_row_drop_partition))
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/cache.py", line 39, in get
    return fill_cache_func()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 138, in <lambda>
    lambda: self._load_rows(parquet_file, piece, shuffle_row_drop_partition))
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 152, in _load_rows
    result = self._read_with_shuffle_row_drop(piece, pq_file, column_names, shuffle_row_drop_range)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py", line 224, in _read_with_shuffle_row_drop
    partitions=self._dataset.partitions
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyarrow/parquet.py", line 562, in read
    table = reader.read_row_group(self.row_group, **options)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyarrow/parquet.py", line 188, in read_row_group
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 696, in pyarrow._parquet.ParquetReader.read_row_group
TypeError: an integer is required
selitvin commented 5 years ago

Are you using pyarrow=0.13? If yes, there is #349 which fixes, what I suspect, is the same issue. Can you try to patch that PR and see if it helps?

working-estimate commented 5 years ago

I do have pyarrow 0.13. With a pull from master (which has that patched in), my error becomes:

  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/arrow_reader_worker.py", line 62, in read_next
    result_dict[column.name] = np.vstack(list_of_lists.tolist())
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "petastorm_genome.py", line 19, in <module>
    python_hello_world()
  File "petastorm_genome.py", line 12, in python_hello_world
    for schema_view in reader:
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/reader.py", line 648, in __next__
    return self._results_queue_reader.read_next(self._workers_pool, self.schema, self.ngram)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/petastorm-0.7.1-py3.7.egg/petastorm/arrow_reader_worker.py", line 67, in read_next
    ', '.join({value.shape[0] for value in list_of_lists}))
TypeError: sequence item 0: expected str instance, int found
selitvin commented 5 years ago

Good. So it was the arrow 0.13 issue. We should probably give a better error message in this case. What happens is that we try to batch samples from multiple rows to produce a matrix. When lists are of a different length, naturally, this is not possible.

This approach worked ok for the original use-case (when all lists were guaranteed to be of the same length, but is probably a poor design choice in general and we had to consume data from Tensorflow (a batch of variable size lists is not compatible with TF tensor data types)).

Can you please provide a little bit more info about your usecase? How do you plan to consume the data? Are you working with Tensorflow/Pytorch/other?

As possibly a temporary solution, you could use make_batch_reader(..., transform_spec=...) to preprocess your data early into a tensor compatible shape. This is obviously a workaround until we find a more appropriate way to address this.

working-estimate commented 5 years ago

I'm looking to work with keras using fit_generator and tensorflow backend. Is there a guide on the usage of transform_spec?

selitvin commented 5 years ago

There are some examples in the documentation, e.g: https://petastorm.readthedocs.io/en/latest/readme_include.html?highlight=transform_spec

You can also look at the way it is being used in the tests, e.g.: https://github.com/uber/petastorm/blob/ccf738e6efdc90f9643bdb6e20e064c7ba924379/petastorm/tests/test_tf_utils.py#L318

selitvin commented 5 years ago

@htokgoz1 , is there anything else I can help with in the context of the issue or we can close it?

working-estimate commented 5 years ago

We can close this, thanks