jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Missing labels for some wells in COMPOUND_EMPTY plates in source_1 #75

Open hanslovsky opened 11 months ago

hanslovsky commented 11 months ago

I join the metadata from load_data_with_illum.parquet files with the data in well.csv.gz to download images for a plate and also get the associated perturbations. I noticed that there are no labels in well.csv.gz for some of the COMPOUND_EMPTY wells in load_data_with_illum.parquet for plate UL001661 in source_1. Note: jc.MetadataFiles.{get_well,get_plate} are convenience functions to read the metadata files at commit 4b24577c2d3228d92177b807fa53fbbc623da1cb. This is not the most recent commit on main and I will double check with the most recent commit on main, too.

In [72]: import pandas as pd

In [73]: import jump_conversion as jc

In [74]: load_data = pd.read_parquet(Path.home() / 'data/jump.zarr/.cache/cpg0016-jump/source_1/workspace/load_data_csv/Batch1_20221004/UL001661/load_data_with_illum.parquet').assign(Metadata_Plate='UL001661')

In [75]: well = jc.MetadataFiles.get_well()

In [76]: plate = jc.MetadataFiles.get_plate()

In [77]: with_jcp = load_data.merge(well, how='left', on=['Metadata_Plate', 'Metadata_Well'])

In [78]: with_jcp[with_jcp.Metadata_JCP2022.isnull()]
Out[78]:
     Metadata_Source_x   Metadata_Batch Metadata_Plate Metadata_Well Metadata_Site      FileName_IllumAGP  ...                                   PathName_OrigDNA                                    PathName_OrigER                                  PathName_OrigMito                                   PathName_OrigRNA Metadata_Source_y Metadata_JCP2022
184           source_1  Batch1_20221004       UL001661           B02             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
185           source_1  Batch1_20221004       UL001661           B02             2  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
186           source_1  Batch1_20221004       UL001661           B02             3  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
187           source_1  Batch1_20221004       UL001661           B02             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
192           source_1  Batch1_20221004       UL001661           B04             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
...                ...              ...            ...           ...           ...                    ...  ...                                                ...                                                ...                                                ...                                                ...               ...              ...
3815          source_1  Batch1_20221004       UL001661           U35             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4760          source_1  Batch1_20221004       UL001661           Z42             1  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4761          source_1  Batch1_20221004       UL001661           Z42             2  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4762          source_1  Batch1_20221004       UL001661           Z42             3  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN
4763          source_1  Batch1_20221004       UL001661           Z42             4  UL001661_IllumAGP.npy  ...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...  s3://cellpainting-gallery/cpg0016-jump/source_...               NaN              NaN

At first, I thought this may be blank images as described in https://github.com/jump-cellpainting/datasets/issues/61#issuecomment-1499094909 but plate UL001661 is not listed in that comment. I downloaded one dna channel image for the wells that I identified from

s3://cellpainting-gallery/cpg0016-jump/source_1/images/Batch1_20221004/images/UL001661__2022-10-05T05_07_32-Measurement1/Images/r02c02f01p01-ch4sk1fk1fl1.tiff

I found that the image is not blank but it is very noisy and with strong artifacts plus visible well edge:

dna

Did these wells not pass QA and should be excluded, and are thus not included in the metadata? Can I extrapolate that to any other well that is not available in well.csv.gz?

Thank you!

hanslovsky commented 11 months ago

I will label those as JCP2022_NAN for my own record keeping so I can easily exclude them.

niranjchandrasekaran commented 11 months ago

Thanks @hanslovsky for flagging this. QC issues could be the reason.

@shntnu were wells not included in wells.csv.gz because of QC issues?