jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Source of truth for all images with perturbation labels #72

Closed hanslovsky closed 3 months ago

hanslovsky commented 11 months ago

I am currently preparing JUMP for our image processing pipeline. We are mostly interested in all images plus perturbation labels for each wells. What is the source of truth for all wells in the dataset? I was able to find some sort of metadatafile (Index.idx.xml, indexfile.txt, MeasurementData.mlf) in the images prefix for all plates except for sources 7 and 8. I use that to create my own metadata table and join that with metadata/well.csv.gz for well treatment labels.

Now I found load_data_csv that may actually be a better source for the metadata for all plates except (I did not check plates 7 and 8 yet):

[('source_3', 'C13451bW'),
 ('source_3', 'C13451dW'),
 ('source_3', 'C13495dW'),
 ('source_3', 'J12440d'),
 ('source_3', 'SP16P19c'),
 ('source_3', 'SP24P27c'),
 ('source_3', 'SP24P27d')]

The sample_notebook.ipynb uses load_data_with_illum.parquet. I ran the same analysis for the parquet files and found that the same plates are missing for parquet.

Now I am thinking that I should use metadata/plate.csv.gz to identify all plates, then find the according load_data_with_illum.parquet file for each plate, and download the data that way. Is this the preferred way to download/process the images?

niranjchandrasekaran commented 11 months ago

Now I am thinking that I should use metadata/plate.csv.gz to identify all plates, then find the according load_data_with_illum.parquet file for each plate, and download the data that way. Is this the preferred way to download/process the images?

Hi @hanslovsky, I believe you are on the right track. Tagging @shntnu who can confirm if this is the recommended approach.

hanslovsky commented 11 months ago

Awesome, thank you! That makes things a lot easier on my side.

Arkkienkeli commented 4 months ago

Hello, is it the case that the metadata files for above mentioned plates are actually missing or there is another source of metadata for those particular plates? Thank you! @niranjchandrasekaran

cp_26_all_phenix1/j12440d/
cp_28_all_phenix1/sp24p27c/
cp_25_all_phenix1/c13451bw/
cp_25_all_phenix1/c13451dw/
cp_28_all_phenix1/sp16p19c/
cp_25_all_phenix1/c13495dw/
cp_28_all_phenix1/sp24p27d/
shntnu commented 3 months ago

Based on our internal notes, these plates were dropped because they failed QC. However, we retained the images in case we wanted to use them to develop QC approaches.

From https://github.com/jump-cellpainting/aws/issues/73#issuecomment-1063006775:

I would vote for removing these plates from the bucket entirely, just to avoid future confusion. If we want to keep the images to develop QC approaches, I would just delete the <plate>.csv.gz files from their profiles/<batch>/ directory. Then it should be clear, that the augmented, normalized, etc. profiles are not missing.

shntnu commented 3 months ago

I've added this to the FAQ issue https://github.com/jump-cellpainting/datasets/issues/62#issuecomment-1999463070