uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Pytorch DataLoader with array of structs #620

Open ramondalmau opened 3 years ago

ramondalmau commented 3 years ago

Dear all

I have a large dataset (df_train) in parquet format, where each row has several columns (integers, floats, etc.), one of which is an array of structs. The schema looks as follows:

ifplid:float
...
root
 |-- ifplid: string (nullable = true)
....
 |-- sequence: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- atfmdelay: integer (nullable = true)
 |    |    |-- taxitime: integer (nullable = true)
               .....

I am trying to generate a DataLoader to train a recurrent neural network (RNN) with PyTorch. Each example of the batch should correspond to one row of the dataset, including some 'static' features and the sequence.

Unfortunately, I have not seen any example of a similar problem. My attempt is:

converter_train = make_spark_converter(df_train)

with converter_train.make_torch_dataloader(batch_size=32) as train_dataloader:
  for d in train_dataloader:
....

But petastorm ignores the sequence of structs, only keeping those columns not in the sequence.

Ignoring unsupported structure ListType(list<element: struct<atfmdelay: int32, taxitime: int32> not null>) for field 'sequence'
  % (field_type, column_name))

Any hint on how to accomplish my objective?

Many thanks in advance Ramon

selitvin commented 3 years ago

ARROW-1644 mentioned in the message seems to be resolved. Good chance it's just a guard in our code. Will take a look.

ramondalmau commented 3 years ago

Dear @selitvin . Many thanks for your reply I just realised that by changing the schema to:

root
 |-- ifplid: string (nullable = true)
 |-- collect_list(atfmdelay): array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- collect_list(taxitime): array (nullable = true)
 |    |-- element: integer (containsNull = false)

That is, instead of an array of structs I use two arrays of integers, it does NOT ignore the fields anymore. Any explanation ? :)

selitvin commented 3 years ago

This goes around the petastorm's check that was guarding against this old lack of functionality (this mixture of list-of-structs/structs-of-lists). Does it solve your issue, or you would still need to support the structure you originally asked about?