`make_batch_reader` returning different numbers of features, labels in the same read

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

`make_batch_reader` returning different numbers of features, labels in the same read #399

Closed srowen closed 5 years ago

srowen commented 5 years ago

I feel like I'm missing something basic here, so I apologize, but: I have a Parquet file (written by Spark) that simply contains a binary-valued column "image" and a int-valued column "label". If I run...

with make_batch_reader(table_path_base_file + "train/") as train_reader:
  for i in range(100):
    next_sample = train_reader.next()
    print(str(next_sample[0].shape) + " " + str(next_sample[1].shape))

Then I get output like:

(62,) (100,)
(62,) (100,)
(62,) (100,)
(62,) (100,)
(62,) (100,)
(62,) (81,)
(62,) (100,)
(62,) (64,)
...

That is, each element from the Reader seems to have a different number of image and label values within the same read. I get 62 images with 100 labels. I'm pretty puzzled as of course each row is 1 image and 1 label.

I understand getting different numbers of pairs per read, no problem, but not mismatched counts here. Of course that makes any training on this output fail.

Did I miss something fundamental about how to use this? thank you for any pointers!

selitvin commented 5 years ago

This does look like a bug. There was a similar issue that was fixed in 0.7.2, can you please confirm you are working with the newest version (0.7.5). Also, what is the pyarrow version you are using?

Each batch corresponds to a single row-group from a parquet file. I wonder, how do these 'extra' label entries look ilke?

srowen commented 5 years ago

Thanks for your reply -- yes I'm using 0.7.5. pyarrow version is 0.12.1.

The contents of the 'image' part are binary arrays and seem intact. The 'label' values are like:

array([ 1,  1,  2,  2,  2,  3,  3,  3,  3,  3,  4,  5,  5,  6,  6,  6,  7,
        7,  8,  8,  8,  8,  8,  8,  9,  9, 10, 10, 11, 11, 11, 11, 11, 11,
       11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 14, 14, 15, 15, 15,
       15, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 19, 19, 19,
       19, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 24, 25, 25, 25, 25, 26,
       26, 26, 27, 28, 29, 29, 29, 29, 30, 30, 30, 32, 32, 32, 33],
      dtype=int32)

... which are as expected values between 1 and 257. (The source data is just the Caltech 256 image data set.) It's just the count that's off.

I'm also setting the Parquet row group size to 1000000 in Spark when saving to generate somewhat smaller row groups.

I can probably reproduce it on one of the Parquet files and upload it somewhere if that would help analyze it. Thank you for any ideas you may have here.

selitvin commented 5 years ago

Yeah, an example of a parquet file would help. I tried reproducing it here: https://github.com/selitvin/petastorm/commit/de087f6519b9fd4953db25ae4a8a06ca253cbb9b (this branch https://github.com/selitvin/petastorm/commits/repro_399) but was not able to observe the issue you are reporting.

srowen commented 5 years ago

I tried reproducing this after regenerating many variations on the dataset, and I can't. I'd chalk it up to some problem with how the files were written, although I didn't change it. Not sure what happened there but I can't reproduce it anymore. I'll close this for now and reopen if I have a concrete reproduction that isn't the whole data set.