Closed srowen closed 5 years ago
This does look like a bug. There was a similar issue that was fixed in 0.7.2, can you please confirm you are working with the newest version (0.7.5). Also, what is the pyarrow version you are using?
Each batch corresponds to a single row-group from a parquet file. I wonder, how do these 'extra' label entries look ilke?
Thanks for your reply -- yes I'm using 0.7.5. pyarrow version is 0.12.1.
The contents of the 'image' part are binary arrays and seem intact. The 'label' values are like:
array([ 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 6, 6, 6, 7,
7, 8, 8, 8, 8, 8, 8, 9, 9, 10, 10, 11, 11, 11, 11, 11, 11,
11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 14, 14, 15, 15, 15,
15, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 19, 19, 19,
19, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 24, 25, 25, 25, 25, 26,
26, 26, 27, 28, 29, 29, 29, 29, 30, 30, 30, 32, 32, 32, 33],
dtype=int32)
... which are as expected values between 1 and 257. (The source data is just the Caltech 256 image data set.) It's just the count that's off.
I'm also setting the Parquet row group size to 1000000 in Spark when saving to generate somewhat smaller row groups.
I can probably reproduce it on one of the Parquet files and upload it somewhere if that would help analyze it. Thank you for any ideas you may have here.
Yeah, an example of a parquet file would help. I tried reproducing it here: https://github.com/selitvin/petastorm/commit/de087f6519b9fd4953db25ae4a8a06ca253cbb9b (this branch https://github.com/selitvin/petastorm/commits/repro_399) but was not able to observe the issue you are reporting.
I tried reproducing this after regenerating many variations on the dataset, and I can't. I'd chalk it up to some problem with how the files were written, although I didn't change it. Not sure what happened there but I can't reproduce it anymore. I'll close this for now and reopen if I have a concrete reproduction that isn't the whole data set.
I feel like I'm missing something basic here, so I apologize, but: I have a Parquet file (written by Spark) that simply contains a binary-valued column "image" and a int-valued column "label". If I run...
Then I get output like:
That is, each element from the Reader seems to have a different number of image and label values within the same read. I get 62 images with 100 labels. I'm pretty puzzled as of course each row is 1 image and 1 label.
I understand getting different numbers of pairs per read, no problem, but not mismatched counts here. Of course that makes any training on this output fail.
Did I miss something fundamental about how to use this? thank you for any pointers!