uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Pandas 1.0.0 compatibity #479

Closed abditag2 closed 4 years ago

abditag2 commented 4 years ago

Yesterday there was a new pandas package released. When using make_batch_reader and DataLoader for pytorch, we are getting this error:

  File "/usr/local/lib/python3.6/dist-packages/horovod/spark/torch/remote.py", line 274, in _train
    row = next(train_loader_iter)
  File "/usr/local/lib/python3.6/dist-packages/petastorm/pytorch.py", line 152, in __iter__
    for row in self.reader:
  File "/usr/local/lib/python3.6/dist-packages/petastorm/reader.py", line 610, in __next__
    return self._results_queue_reader.read_next(self._workers_pool, self.schema, self.ngram)
  File "/usr/local/lib/python3.6/dist-packages/petastorm/arrow_reader_worker.py", line 60, in read_next
    column_as_numpy = column_as_pandas.as_matrix()
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5273, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'as_matrix'

The root cause is that new pandas.Series does not have as_matrix