Open ramondalmau opened 3 years ago
ARROW-1644 mentioned in the message seems to be resolved. Good chance it's just a guard in our code. Will take a look.
Dear @selitvin . Many thanks for your reply I just realised that by changing the schema to:
root
|-- ifplid: string (nullable = true)
|-- collect_list(atfmdelay): array (nullable = true)
| |-- element: integer (containsNull = false)
|-- collect_list(taxitime): array (nullable = true)
| |-- element: integer (containsNull = false)
That is, instead of an array of structs I use two arrays of integers, it does NOT ignore the fields anymore. Any explanation ? :)
This goes around the petastorm's check that was guarding against this old lack of functionality (this mixture of list-of-structs/structs-of-lists). Does it solve your issue, or you would still need to support the structure you originally asked about?
Dear all
I have a large dataset (df_train) in parquet format, where each row has several columns (integers, floats, etc.), one of which is an array of structs. The schema looks as follows:
I am trying to generate a DataLoader to train a recurrent neural network (RNN) with PyTorch. Each example of the batch should correspond to one row of the dataset, including some 'static' features and the sequence.
Unfortunately, I have not seen any example of a similar problem. My attempt is:
But petastorm ignores the sequence of structs, only keeping those columns not in the sequence.
Any hint on how to accomplish my objective?
Many thanks in advance Ramon