jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Report corrupt TIFF files, filter load_data where images are actually missing #76

Open hanslovsky opened 10 months ago

hanslovsky commented 10 months ago

I found a few corrupt tiff files in the JUMP production dataset. So far, I have only seen corrupt tiff files in sources 1 and 7 (4 files each). I will report back any additional corrupt tiff files that I may find during my download/conversion.

Here is what I have so far:

s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

How to confirm that these files are corrupt:

$ urls=(
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement\ 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
)

$ for url in "${urls[@]}"; do aws s3 --no-sign-request cp $url .; done
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r03c04f01p01-ch1sk1fk1fl1.tiff to ./r03c04f01p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c18f02p01-ch4sk1fk1fl1.tiff to ./r04c18f02p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c19f02p01-ch1sk1fk1fl1.tiff to ./r04c19f02p01-ch1sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch3_20221010/images/UL000087__2022-10-11T10_57_24-Measurement1/Images/r04c37f04p01-ch4sk1fk1fl1.tiff to ./r04c37f04p01-ch4sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_3/images/CP60/images/BR5876c3__2022-04-29T20_47_20-Measurement 1/Images/r11c22f08p01-ch3sk1fk1fl1.tiff to ./r11c22f08p01-ch3sk1fk1fl1.tiff
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif to ./CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif to ./CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
download: s3://cellpainting-gallery/cpg0016-jump/source_7/images/20210727_Run3/images/CP3-SC1-18/CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif to ./CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif

$ du -hs *tif *tiff
2.7M    CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif
2.7M    CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif
3.1M    r03c04f01p01-ch1sk1fk1fl1.tiff
2.8M    r04c18f02p01-ch4sk1fk1fl1.tiff
3.1M    r04c19f02p01-ch1sk1fk1fl1.tiff
2.6M    r04c37f04p01-ch4sk1fk1fl1.tiff
0       r11c22f08p01-ch3sk1fk1fl1.tiff

$ identify *tif *tiff
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F003L01A03Z01C04.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C01.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A01Z01C02.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `CP3-SC1-18_I22_T0001F004L01A02Z01C03.tif' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r03c04f01p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c18f02p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c19f02p01-ch1sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Not a TIFF or MDI file, bad magic number 0 (0x0). `r04c37f04p01-ch4sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.
identify: Cannot read TIFF header. `r11c22f08p01-ch3sk1fk1fl1.tiff' @ error/tiff.c/TIFFErrors/599.

Notes:

  1. Those files seem to have the expected file size (except for the one from source 3), but the magic number is invalid/bad.
  2. I updated the list with 1 corrupt file from source 3
  3. I finished download of all other sources except 11 and have not found any other corrupt files.
niranjchandrasekaran commented 10 months ago

Thanks @hanslovsky for flagging these. ccing @shntnu to bring this to his attention.

hanslovsky commented 10 months ago

I downloaded all sources except source 11 (still working on that) and found only one additional corrupt file in source 3. All other sources (except 11) did not have corrupt files.

shntnu commented 6 months ago

Thank you so much for reporting this @hanslovsky

Arkkienkeli commented 4 months ago

I did run identify on all sources and created a list of all corrupted images according to this utility. If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here https://github.com/jump-cellpainting/datasets/issues/61).

@shntnu @hanslovsky

Corrupted_images.csv

hanslovsky commented 4 months ago

@Arkkienkeli your findings are consistent with mine (I did not report any corrupted images that are not in the metadata), with the exception of the one image of source 11. I did not report anything for source 11 in this issue because I was still working on it at that time. I will double-check my records to see if I have any notes on corrupted files for source 11.

I know that I reported missing images for source 11 in #78 but I don't know if that includes any corrupted images.

cc @shntnu

hanslovsky commented 4 months ago

@Arkkienkeli I just double-checked the images I reported missing in source 11 (source_11-404.txt) and I found the image you reported corrupted in there as well. Now I can conclusively say that both our reportings are consistent.

Please note that I also found some images in source 11 that were simply not present, in plates EC000038and EC000066

shntnu commented 4 months ago

I will drop in some notes for now

cat ~/Downloads/source_11-404.txt |cut -d"/" -f6|sort|uniq -c
6064 EC000038__2021-06-04T17_37_00-Measurement1
   2 EC000066__2021-06-06T12_36_15-Measurement1
   1 EC000070__2021-06-09T23_50_19-Measurement1
   1 failed-paths
csvcut -c Source,Batch,Plate ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      19

csvcut -c Source,Batch ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      15

csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
      6

csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq -c
   5 1
  23 10
   1 11
   1 3
   4 7
   1 Source

Internal notes

  1. EC000038 on batch2. This plate has the metadata (xml file) and a significant number of images missing. I checked with XXX and she says they are also missing on the microscopy server. Should this be skipped?
  2. Order-of-magnitude, how many images are missing - 10, 100, 1000, 10000? I assume with no Index.idx.xml file you weren't able to run pe2loaddata, but it's pretty trivial to just make the load_data and load_data_with_illum CSVs from another plate in the batch with a find-and-replace on the plate name (and removing missing files from the load_data  csv per above). I think as long as you have at least say, half the plate still present, no reason to throw out this data.
  3. EC000038 on batch2. I checked it and found out we had > 2000 image sets useable. Copied over the xml file from another plate and processed it.
shntnu commented 4 months ago

Alright, overall

@hanslovsky @Arkkienkeli -- thank you so much for reporting this! You can proceed by simply ignoring these images. Our task is to update the load data files to remove the discrepancy

shntnu commented 4 months ago

I did run identify on all sources and created a list of all corrupted images according to this utility. If a value from Channel \ Well \ Site is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10 are actually not in the metadata (probably it is described here #61).

@shntnu @hanslovsky

Corrupted_images.csv

Regarding the corrupted files, we should likely take the same strategy – drop them from load_data. @Arkkienkeli -- You can proceed by ignoring these images because we no longer have access to the originals (thankfully that's only 34 images out of the gazillion)