Support for parquet files with nested structures

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Support for parquet files with nested structures #690

Open mossadhelali opened 3 years ago

mossadhelali commented 3 years ago

I am trying to use Petastorm to pass a PySpark dataframe (read from a parquet file) to PyTorch.

In my case, the dataframe consists of 3 columns:

samples in columns 1 and 2 are 2D arrays of integers with size (None, 32)
samples in column 3 are scalars

When making the Spark converter using make_spark_converter() and make_torch_dataloader(), columns 1 and 2 are ignored and I get the following warning: UserWarning: [ARROW-1644] Ignoring unsupported structure ListType(list<element: list<element: int32>>)

As I understand, discarding nested structures was added to Petastorm as a workaround to a pyarrow bug for parquet files with nested structures. However, the pyarrow bug has been fixed since arrow version 2.0.0 (see Python notes HERE), my parquet file can be read with pyarrow (after upgrading to latest version).

I would appreciate if such parquet files can be supported by Petastorm.

Best, Mossad

dmcguire81 commented 3 years ago

@selitvin I can confirm the PyArrow fix and this should be a straight-forward patch to remove the previous workaround, subject to bumping the minimum version of the associated dependency to pyarrow>=2.0.0. Is there any reason that wouldn't work for consumers (i.e.: any known use cases that require earlier versions of PyArrow)?

dmcguire81 commented 3 years ago

@mossadhelali want to submit a PR, as described above?

mossadhelali commented 3 years ago

@dmcguire81 Thanks for your reply and confirmation. I removed the PyArrow workaround and bumped up pyarrow to pyarrow>=2.0.0 but now I get the following error in arrow_reader_worker.py here : Length of all values in column 'col1' are expected to be the same length. Got the following set of lengths: '1000, 1000, 563, 1000, 1000, 945, 1000, 945'

Seems that lists of lists are assumed to be of the same length across all samples. In my case col1 in the parquet file has the shape (num_training_samples, None, 32).

Can you please comment on whether the fix is still simple enough for me to implement?

dmcguire81 commented 3 years ago

My guess would be that the fix that actually helps you make progress is now more complex, given the extra constraints, but just the original goal of this issue ("Support for parquet files with nested structures"), in the general case, stays simple, whether or not that's useful for your problem.

dmcguire81 commented 3 years ago

I'm not familiar enough with make_torch_dataloader to venture a guess, but at least the reported dimension of (num_training_samples, None, 32) for col1 coming out of the code and the dimension you thought you had in the description ((None, 32)) are mismatched. Is it possible that meant to have a Parquet Row Group with num_training_samples rows of (None, 32)-dimensioned matrices, instead?

selitvin commented 3 years ago

Thank you @mossadhelali for picking it up and @dmcguire81 for your help.

There is an assumption that all lists have the same length in the current collate implementation (the code @mossadhelali has pointed).

One way to go around it is to implement a TransformSpec that would align the lengths. This is geared towards TF constraint of working with tensors only. For pytorch users, it's too constraining I guess. Perhaps a better solution is to introduce a user defined collate function that would relax this constraint?

I can not work on this in the several coming weeks, but if @mossadhelali can propose a PR with a solution we could work together to get it in.
I wonder if we can let user specify a collate function implementation that

MrSquidward commented 2 years ago

Any updates on this issue? Will this workaround be removed in the next version?

baumanab commented 2 years ago

This issue is currently blocking a project I’m working on. Has anyone created a PR to address? @mossadhelali how did you address the original issue?

mossadhelali commented 2 years ago

@baumanab I ended up padding the vectors to make them the same length and created a mask that is multiplied by the model outputs, i.e. similar to what is done in sequence models in NLP.

@selitvin Unfortunately, I don't have the time currently to work on a fix.

baumanab commented 2 years ago

Re-reading, it looks straight forward to address in setup.py and unischema logic. @mossadhelali please let me know if you plan to create a PR, otherwise I’ll probably create one over the next week.

mirko-m commented 2 years ago

@baumanab did you end up creating a PR for this? I am also running into the same issue.

baumanab commented 2 years ago

I started working on it, but put it down due to other time demands. Are you interested in collaborating on a fix? I can point you to where I am.

On Tue, Oct 4, 2022 at 9:00 AM mirko-m @.***> wrote:

@baumanab https://github.com/baumanab did you end up creating a PR for this? I am also running into the same issue.

— Reply to this email directly, view it on GitHub https://github.com/uber/petastorm/issues/690#issuecomment-1267226627, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA5ZEWU5L5QQ5H3CDKOPUTWBRIAHANCNFSM46UA6MYQ . You are receiving this because you were mentioned.Message ID: @.***>

mirko-m commented 2 years ago

I found a workaround, so this is no longer a pressing issue for me. However, I am interested in collaborating on a fix since that seems like a good opportunity to dive a bit deeper into petastorm.

baumanab commented 2 years ago

That’s great to hear. I started on a fix for the PR. I’ll share what I have. What was your workaround?

On Oct 7, 2022, at 11:32 AM, mirko-m @.***> wrote:

I found a workaround, so this is no longer a pressing issue for me. However, I am interested in collaborating on a fix since that seems like a good opportunity to dive a bit deeper into petastorm.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

mirko-m commented 2 years ago

Sorry for the late reply. My workaround is to have multiple columns with flat arrays instead of a single column with a nested array. Then, in the loop over the dataloader I stack the flat arrays from the columns. Please let me know what I can do to help.

Data-drone commented 2 years ago

@baumanab Where is your fix up to in terms of progress?

baumanab commented 2 years ago

My first attempt is directed surfacing the ignore flag to top layer and updating pyarrow to a version that handles nested structures. I was pretty close but had to put it down. I’ll work over the next week to complete. I could definitely use some help with review and test cases.

On Oct 23, 2022, at 4:09 AM, Brian L @.***> wrote:

@baumanab Where is your fix up to in terms of progress?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

baumanab commented 2 years ago

Sorry for the late reply. My workaround is to have multiple columns with flat arrays instead of a single column with a nested array. Then, in the loop over the dataloader I stack the flat arrays from the columns. Please let me know what I can do to help.

No problem, sorry for mine as well. I could use help with review and test cases. I plan to have something testable no later than 1 week from today. Great workaround btw.

mirko-m commented 2 years ago

Sounds good. Just ping me when you have something and I can try to help.

fkemeth commented 1 year ago

Are there any news on supporting nested structures?

mirko-m commented 1 year ago

I have not heard anything since my last post.