Open hanslovsky opened 10 months ago
Thanks @hanslovsky for flagging these. ccing @shntnu to bring this to his attention.
I downloaded all sources except source 11 (still working on that) and found only one additional corrupt file in source 3. All other sources (except 11) did not have corrupt files.
Thank you so much for reporting this @hanslovsky
source_11
had any corrupt files?images.csv
to report missing/corrupt images). I did run identify
on all sources and created a list of all corrupted images according to this utility.
If a value from Channel
\ Well
\ Site
is missing, it means that the image is not in the metadata, for example, all corrupted images in this list from source_10
are actually not in the metadata (probably it is described here https://github.com/jump-cellpainting/datasets/issues/61).
@shntnu @hanslovsky
@Arkkienkeli your findings are consistent with mine (I did not report any corrupted images that are not in the metadata), with the exception of the one image of source 11. I did not report anything for source 11 in this issue because I was still working on it at that time. I will double-check my records to see if I have any notes on corrupted files for source 11.
I know that I reported missing images for source 11 in #78 but I don't know if that includes any corrupted images.
cc @shntnu
@Arkkienkeli I just double-checked the images I reported missing in source 11 (source_11-404.txt) and I found the image you reported corrupted in there as well. Now I can conclusively say that both our reportings are consistent.
Please note that I also found some images in source 11 that were simply not present, in plates EC000038and EC000066
I will drop in some notes for now
cat ~/Downloads/source_11-404.txt |cut -d"/" -f6|sort|uniq -c
6064 EC000038__2021-06-04T17_37_00-Measurement1
2 EC000066__2021-06-06T12_36_15-Measurement1
1 EC000070__2021-06-09T23_50_19-Measurement1
1 failed-paths
csvcut -c Source,Batch,Plate ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
19
csvcut -c Source,Batch ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
15
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq|wc -l
6
csvcut -c Source ~/Downloads/Corrupted_images.csv |sort|uniq -c
5 1
23 10
1 11
1 3
4 7
1 Source
Internal notes
Alright, overall
EC000038
the files that were missing here are because we created the load_data file by hand (see internal notes in the previous comment). We should edit the load_data to filter out the sites that have a missing imageEC000066
and EC000070
- turns out these two plates are also among those where we created the load_data file by hand, so we should do the same heresource_11
plates missing load_data files: EC000038 , EC000066, EC000070, EC000156, EC000157
so we should expect similar issues with all of these@hanslovsky @Arkkienkeli -- thank you so much for reporting this! You can proceed by simply ignoring these images. Our task is to update the load data files to remove the discrepancy
I did run
identify
on all sources and created a list of all corrupted images according to this utility. If a value fromChannel
\Well
\Site
is missing, it means that the image is not in the metadata, for example, all corrupted images in this list fromsource_10
are actually not in the metadata (probably it is described here #61).@shntnu @hanslovsky
Regarding the corrupted files, we should likely take the same strategy – drop them from load_data. @Arkkienkeli -- You can proceed by ignoring these images because we no longer have access to the originals (thankfully that's only 34 images out of the gazillion)
I found a few corrupt tiff files in the JUMP production dataset. So far, I have only seen corrupt tiff files in sources 1 and 7 (4 files each). I will report back any additional corrupt tiff files that I may find during my download/conversion.
Here is what I have so far:
How to confirm that these files are corrupt:
Notes: