Open mossadhelali opened 3 years ago
@selitvin I can confirm the PyArrow fix and this should be a straight-forward patch to remove the previous workaround, subject to bumping the minimum version of the associated dependency to pyarrow>=2.0.0
. Is there any reason that wouldn't work for consumers (i.e.: any known use cases that require earlier versions of PyArrow)?
@mossadhelali want to submit a PR, as described above?
@dmcguire81 Thanks for your reply and confirmation. I removed the PyArrow workaround and bumped up pyarrow to pyarrow>=2.0.0
but now I get the following error in arrow_reader_worker.py
here :
Length of all values in column 'col1' are expected to be the same length. Got the following set of lengths: '1000, 1000, 563, 1000, 1000, 945, 1000, 945'
Seems that lists of lists are assumed to be of the same length across all samples. In my case col1
in the parquet file has the shape (num_training_samples, None, 32)
.
Can you please comment on whether the fix is still simple enough for me to implement?
My guess would be that the fix that actually helps you make progress is now more complex, given the extra constraints, but just the original goal of this issue ("Support for parquet files with nested structures"), in the general case, stays simple, whether or not that's useful for your problem.
I'm not familiar enough with make_torch_dataloader
to venture a guess, but at least the reported dimension of (num_training_samples, None, 32)
for col1
coming out of the code and the dimension you thought you had in the description ((None, 32)
) are mismatched. Is it possible that meant to have a Parquet Row Group with num_training_samples
rows of (None, 32)
-dimensioned matrices, instead?
Thank you @mossadhelali for picking it up and @dmcguire81 for your help.
There is an assumption that all lists have the same length in the current collate implementation (the code @mossadhelali has pointed).
One way to go around it is to implement a TransformSpec that would align the lengths. This is geared towards TF constraint of working with tensors only. For pytorch users, it's too constraining I guess. Perhaps a better solution is to introduce a user defined collate function that would relax this constraint?
I can not work on this in the several coming weeks, but if @mossadhelali can propose a PR with a solution we could work together to get it in.
I wonder if we can let user specify a collate function implementation that
Any updates on this issue? Will this workaround be removed in the next version?
This issue is currently blocking a project I’m working on. Has anyone created a PR to address? @mossadhelali how did you address the original issue?
@baumanab I ended up padding the vectors to make them the same length and created a mask that is multiplied by the model outputs, i.e. similar to what is done in sequence models in NLP.
@selitvin Unfortunately, I don't have the time currently to work on a fix.
Re-reading, it looks straight forward to address in setup.py and unischema logic. @mossadhelali please let me know if you plan to create a PR, otherwise I’ll probably create one over the next week.
@baumanab did you end up creating a PR for this? I am also running into the same issue.
I started working on it, but put it down due to other time demands. Are you interested in collaborating on a fix? I can point you to where I am.
On Tue, Oct 4, 2022 at 9:00 AM mirko-m @.***> wrote:
@baumanab https://github.com/baumanab did you end up creating a PR for this? I am also running into the same issue.
— Reply to this email directly, view it on GitHub https://github.com/uber/petastorm/issues/690#issuecomment-1267226627, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA5ZEWU5L5QQ5H3CDKOPUTWBRIAHANCNFSM46UA6MYQ . You are receiving this because you were mentioned.Message ID: @.***>
I found a workaround, so this is no longer a pressing issue for me. However, I am interested in collaborating on a fix since that seems like a good opportunity to dive a bit deeper into petastorm.
That’s great to hear. I started on a fix for the PR. I’ll share what I have. What was your workaround?
On Oct 7, 2022, at 11:32 AM, mirko-m @.***> wrote:
I found a workaround, so this is no longer a pressing issue for me. However, I am interested in collaborating on a fix since that seems like a good opportunity to dive a bit deeper into petastorm.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
Sorry for the late reply. My workaround is to have multiple columns with flat arrays instead of a single column with a nested array. Then, in the loop over the dataloader I stack the flat arrays from the columns. Please let me know what I can do to help.
@baumanab Where is your fix up to in terms of progress?
My first attempt is directed surfacing the ignore flag to top layer and updating pyarrow to a version that handles nested structures. I was pretty close but had to put it down. I’ll work over the next week to complete. I could definitely use some help with review and test cases.
On Oct 23, 2022, at 4:09 AM, Brian L @.***> wrote:
@baumanab Where is your fix up to in terms of progress?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
Sorry for the late reply. My workaround is to have multiple columns with flat arrays instead of a single column with a nested array. Then, in the loop over the dataloader I stack the flat arrays from the columns. Please let me know what I can do to help.
No problem, sorry for mine as well. I could use help with review and test cases. I plan to have something testable no later than 1 week from today. Great workaround btw.
Sounds good. Just ping me when you have something and I can try to help.
Are there any news on supporting nested structures?
I have not heard anything since my last post.
I am trying to use Petastorm to pass a PySpark dataframe (read from a parquet file) to PyTorch.
In my case, the dataframe consists of 3 columns:
When making the Spark converter using
make_spark_converter()
andmake_torch_dataloader()
, columns 1 and 2 are ignored and I get the following warning:UserWarning: [ARROW-1644] Ignoring unsupported structure ListType(list<element: list<element: int32>>)
As I understand, discarding nested structures was added to Petastorm as a workaround to a pyarrow bug for parquet files with nested structures. However, the pyarrow bug has been fixed since arrow version 2.0.0 (see Python notes HERE), my parquet file can be read with pyarrow (after upgrading to latest version).
I would appreciate if such parquet files can be supported by Petastorm.
Best, Mossad