jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
155 stars 16 forks source link

Some wells in load_data are missing (but are present in `wells.csv.gz`) #61

Open NinoDui opened 1 year ago

NinoDui commented 1 year ago

Hi there,

I happened to find the metadata for wells of source_10, batch 2021_08_12_U2OS_48_hr_run15, and plate Dest210803-160702 may be missed. May I get your help in double-checking it? Or feel free to correct me if I am not taking it at the right pace.

A quick demo of how to revise it:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

nan_index = {
    'Metadata_Source': 'source_10',
    "Metadata_Batch": '2021_08_12_U2OS_48_hr_run15',
    "Metadata_Plate": 'Dest210803-160702'
}

s3_path = S3_LOADDATA_FORMATTER.format(**nan_index)
s3_nan_meta = pd.read_parquet(s3_path, storage_options={"anon": True})
wells_from_parquet = s3_nan_meta['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to C22

wells = pd.read_csv(os.path.join(DATA_ROOT, "metadata", "well.csv.gz"))
wells_plate_info = wells.loc[(wells['Metadata_Source'] == nan_index['Metadata_Source']) & (wells['Metadata_Plate'] == nan_index['Metadata_Plate']), :]
wells_from_plate = wells_plate_info['Metadata_Well'].unique() # <----- Here wells are enumerated from A01 to P24

It seems like the well info provided by wells.csv.gz is larger in amount compared to those retrieved from load_data_with_illum.parquet. Is that a corner case that I missed? Or is that being uploaded in progress?

Thanks for your time and effort.

Best wishes, Nino

niranjchandrasekaran commented 1 year ago

cc @shntnu

NinoDui commented 1 year ago

Hey, I was wondering if there's any update about it? Thanks.

niranjchandrasekaran commented 1 year ago

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

NinoDui commented 1 year ago

Hi @NinoDui, I checked the load_data_with_illum.parquet and well.csv.gz files and found the same missing wells as you did.

@shntnu it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

@niranjchandrasekaran

Hey Niranj, thank you for all the support. Much appreciated it if we could figure it out. Thanks in advance :)

shntnu commented 1 year ago

Update: This comment ended up being off-topic but I am retaining it here because it is useful information


I'll copy over some notes from an internal thread and then respond in the next comment


[...]

Relatedly, there are wells for which there are images, but they don't appear in the metadata files. I think we have 9 of these wells:

             plate well
36189   BR00121426  H14
39644   BR00121427  H14
43100   BR00121428  H14
53468   BR00121436  H14
598321  BR00125181  C10
663975  BR00125638  C10
665352  BR00125638  I19
948012  BR00127147  F16
948066  BR00127147  F22

If you take that first one, for example, it is definitely the case that the files exist in the S3 buckets:

$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_08_09_Batch11/images/BR00121426__2021-08-14T15_29_30-Measurement1/Images/ | grep r08c14 | head -5
2022-10-20 22:08:31    1853777 r08c14f01p01-ch1sk1fk1fl1.tiff
2022-10-20 22:05:06    1904709 r08c14f01p01-ch2sk1fk1fl1.tiff
2022-10-20 22:08:32    1712279 r08c14f01p01-ch3sk1fk1fl1.tiff
2022-10-20 22:08:32    1749803 r08c14f01p01-ch4sk1fk1fl1.tiff
2022-10-20 22:05:07    1816711 r08c14f01p01-ch5sk1fk1fl1.tiff

...but it does not exist in the metadata manifest:

In [23]: wells[(wells.Metadata_Plate == "BR00121426") & (wells.Metadata_Well == "H14")]
Out[23]: 
Empty DataFrame
Columns: [Metadata_Source, Metadata_Plate, Metadata_Well, Metadata_JCP2022]
Index: []

On further probing, it turned out that the wells were blank.

image

shntnu commented 1 year ago

As noted in https://github.com/jump-cellpainting/datasets/issues/61#issuecomment-1499094909, if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

I'll add this issue to our FAQ #62

niranjchandrasekaran commented 1 year ago

As noted in https://github.com/jump-cellpainting/datasets/issues/61#issuecomment-1499094909, if a well is empty, the images will exist but no profiles will be created, and thus there will be no entry in wells.csv.gz

IIUC, the problem seems to be the other way around. For this plate, there are wells and in well.csv.gz and the aggregated profile parquet file also has all 384 wells. But the load_data_with_illum files seem to be missing wells.

shntnu commented 1 year ago

Sorry, I totally missed that.

it looks like load_data.csv.gz has all 384 wells, but both load_data_with_illum.csv.gz and load_data_with_illum.parquet have only 70 wells.

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data.csv.gz -|gunzip - |wc -l
#     417

aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_10/workspace/load_data_csv/2021_08_12_U2OS_48_hr_run15/Dest210803-160702/load_data_with_illum.csv.gz -|gunzip - |wc -l
#     417

That said, I confirmed that the images folder does have all 384 wells

aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_10/images/2021_08_12_U2OS_48_hr_run15/images/Dest210803-160702/ |gzip > ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|wc -l
#      384

gzcat ~/Desktop/source_10_2021_08_12_U2OS_48_hr_run15_Dest210803-160702.txt.gz|grep tif$|grep Dest21|tr -s " "|cut -d" " -f4|cut -d"_" -f2|sort|uniq -c|cut -d" " -f3|sort |uniq -c
# 384 48
niranjchandrasekaran commented 1 year ago

load_data.csv.gz seems to have the same number of rows as load_data_with_illum.csv.gz. can you verify @niranjchandrasekaran ?

You are right. load_data.csv.gz also has the same number of wells as the two illum files.

shntnu commented 1 year ago

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py

import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)

Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

NinoDui commented 1 year ago

I've just checked with my own comparison script and found 125 mismatches (at plate level) between the number of wells available on a plate and those provided by corresponding load_data_with_illum.parquet.

Results

The result is attached.

well_diff.csv

Script

How the wells from parquet are counted:

S3_LOADDATA_FORMATTER = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)

def fetch_well_from_parquet(row: pd.Series):
    meta_path = S3_LOADDATA_FORMATTER.format(**row)
    meta = pd.read_parquet(meta_path, storage_options=REMOTE_STORAGE_OPTION)
    wells_from_parquet = meta['Metadata_Well'].unique()
    return wells_from_parquet.shape[0]

How the number of wells from plate & well metas are calculated:

wells_info = wells.merge(plates, on=['Metadata_Source', 'Metadata_Plate'])
wells_info = wells_info[["Metadata_Source", "Metadata_Batch", "Metadata_Plate", "Metadata_Well"]]
well_count = wells_info.groupby(['Metadata_Source', 'Metadata_Batch', 'Metadata_Plate'])\
        .agg(n_well_on_plate=('Metadata_Well', pd.Series.count))

@NinoDui Thank you so much for flagging this! Can you help us report how prevalent this issue is?

Here's how you'd do it

use https://github.com/broadinstitute/position-effect-correction/blob/aabfac2de536447ad095489be6eca73f3a5ae026/1.load/load.py

import pandas as pd

source_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

df_all = pd.DataFrame()

for source_id in source_ids:

    df = load(
        dataset="cpg0016-jump",
        source=f"source_{source_id}}",
        component="load_data_csv",
        batch="2021_06_14_Batch6",
        plate="BR00121429",
        columns=["Metadata_Source", "Metadata_Plate", "Metadata_Well"],
    )

    # keep only distinct rows
    df = df.drop_duplicates()

    df_all = df_all.append(df)

Now report which wells are present in well.csv.gz but absent in df_all

Warning: I haven't checked the code (that was autogenerated by Copilot :D)

shntnu commented 1 year ago

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!

April 2024 Update: I took the CSV in https://github.com/jump-cellpainting/datasets/issues/61#issuecomment-1508083098 and turned it into a table below for easy searching.

`weird` means `more_wells_with_profiles_than_with_images` Metadata_Source | Metadata_Plate | n_well_on_plate | n_well_from_parquet |weird -- | -- | -- | -- | -- source_1 | UL001659 | 1471 | 1472 | FALSE source_1 | UL001661 | 1355 | 1472 | FALSE source_1 | UL001649 | 1471 | 1472 | FALSE source_1 | UL000083 | 1419 | 1472 | FALSE source_1 | UL000093 | 1471 | 1472 | FALSE source_1 | UL001769 | 1471 | 1472 | FALSE source_1 | UL001771 | 1471 | 1472 | FALSE source_1 | UL000597 | 1309 | 1472 | FALSE source_1 | UL001717 | 1471 | 1472 | FALSE source_10 | Dest210608-152610 | 383 | 384 | FALSE source_10 | Dest210608-153057 | 383 | 384 | FALSE source_10 | Dest210614-164418 | 382 | 384 | FALSE source_10 | Dest210614-164906 | 383 | 384 | FALSE source_10 | Dest210615-151336 | 383 | 384 | FALSE source_10 | Dest210622-144809 | 383 | 384 | FALSE source_10 | Dest210726-155958 | 383 | 384 | FALSE source_10 | Dest210726-160957 | 383 | 384 | FALSE source_10 | Dest210726-162012 | 383 | 384 | FALSE source_10 | Dest210803-160702 | 384 | 70 | TRUE source_10 | Dest210809-134534 | 383 | 384 | FALSE source_10 | Dest210810-173723 | 383 | 384 | FALSE source_10 | Dest210823-173617 | 383 | 384 | FALSE source_11 | EC000004 | 383 | 384 | FALSE source_11 | EC000038 | 250 | 384 | FALSE source_11 | EC000045 | 189 | 384 | FALSE source_11 | EC000048 | 126 | 384 | FALSE source_11 | EC000068 | 382 | 383 | FALSE source_11 | LM37-70_1 | 383 | 384 | FALSE source_11 | EC000137 | 250 | 384 | FALSE source_11 | EC000115 | 378 | 381 | FALSE source_11 | EC000121 | 381 | 383 | FALSE source_11 | EC000094 | 360 | 384 | FALSE source_11 | LM71-102_1 | 383 | 384 | FALSE source_11 | LM71-102_2 | 382 | 384 | FALSE source_2 | 1053601763 | 383 | 384 | FALSE source_2 | 1053601770 | 383 | 384 | FALSE source_2 | 1053601831 | 383 | 384 | FALSE source_2 | 1053601909 | 383 | 384 | FALSE source_2 | 1053600674 | 383 | 384 | FALSE source_2 | 1053600728 | 383 | 384 | FALSE source_2 | 1053600803 | 383 | 384 | FALSE source_2 | 1053600810 | 383 | 384 | FALSE source_2 | 1053600834 | 383 | 384 | FALSE source_2 | 1053600858 | 383 | 384 | FALSE source_2 | 1053600872 | 383 | 384 | FALSE source_2 | 1053599503 | 378 | 384 | FALSE source_2 | 1053599602 | 383 | 384 | FALSE source_2 | 1053599657 | 383 | 384 | FALSE source_2 | 1086293911 | 383 | 384 | FALSE source_2 | 1086293997 | 383 | 384 | FALSE source_2 | 1086293423 | 383 | 384 | FALSE source_2 | 1086293447 | 383 | 384 | FALSE source_2 | 1086293485 | 382 | 384 | FALSE source_2 | 1086293027 | 383 | 384 | FALSE source_2 | 1086293034 | 383 | 384 | FALSE source_2 | 1086293133 | 383 | 384 | FALSE source_2 | 1086293164 | 383 | 384 | FALSE source_2 | 1086293232 | 383 | 384 | FALSE source_2 | 1086292723 | 381 | 384 | FALSE source_2 | 1086292761 | 383 | 384 | FALSE source_2 | 1086292884 | 383 | 384 | FALSE source_2 | 1086292976 | 383 | 384 | FALSE source_2 | 1086292389 | 383 | 384 | FALSE source_2 | 1086292396 | 383 | 384 | FALSE source_2 | 1086292440 | 382 | 384 | FALSE source_2 | 1086292495 | 382 | 384 | FALSE source_2 | 1086291931 | 383 | 384 | FALSE source_2 | 1086291948 | 382 | 384 | FALSE source_2 | 1086291962 | 382 | 384 | FALSE source_2 | 1086291979 | 382 | 384 | FALSE source_2 | 1086291986 | 382 | 384 | FALSE source_2 | 1086292013 | 381 | 384 | FALSE source_2 | 1086292037 | 383 | 384 | FALSE source_2 | 1086292044 | 382 | 384 | FALSE source_2 | 1086292051 | 382 | 384 | FALSE source_2 | 1086292075 | 380 | 384 | FALSE source_2 | 1086292082 | 383 | 384 | FALSE source_2 | 1086292099 | 380 | 384 | FALSE source_2 | 1086292105 | 382 | 384 | FALSE source_2 | 1086292136 | 382 | 384 | FALSE source_2 | 1086292143 | 382 | 384 | FALSE source_2 | 1086292150 | 383 | 384 | FALSE source_2 | 1086289792 | 383 | 384 | FALSE source_3 | BR5874c3 | 383 | 384 | FALSE source_3 | SP01P04b | 381 | 384 | FALSE source_3 | B040203d | 382 | 384 | FALSE source_3 | A13407bW | 383 | 384 | FALSE source_3 | P24P27dW | 319 | 384 | FALSE source_3 | DMSOC45 | 383 | 384 | FALSE source_3 | JCPQC036 | 383 | 384 | FALSE source_3 | BAY5872a | 383 | 384 | FALSE source_3 | BAY5874b | 382 | 384 | FALSE source_3 | BAY5874c | 383 | 384 | FALSE source_3 | BAY5875a | 383 | 384 | FALSE source_3 | BAY5875c | 383 | 384 | FALSE source_3 | BAY5876d | 383 | 384 | FALSE source_4 | BR00121436 | 383 | 384 | FALSE source_4 | BR00121428 | 383 | 384 | FALSE source_4 | BR00125181 | 383 | 384 | FALSE source_4 | BR00125638 | 382 | 384 | FALSE source_4 | BR00121427 | 383 | 384 | FALSE source_4 | BR00121426 | 383 | 384 | FALSE source_4 | BR00127147 | 382 | 384 | FALSE source_5 | ACPJUM032 | 382 | 384 | FALSE source_5 | APTJUM230 | 383 | 384 | FALSE source_5 | ATSJUM107 | 383 | 384 | FALSE source_6 | 1.1E+11 | 383 | 384 | FALSE source_6 | 1.1E+11 | 383 | 384 | FALSE source_6 | 1.1E+11 | 383 | 384 | FALSE source_6 | 1.1E+11 | 383 | 384 | FALSE source_6 | 1.1E+11 | 383 | 384 | FALSE source_7 | CP1-SC1-08 | 381 | 384 | FALSE source_7 | CP1-SC1-09 | 383 | 384 | FALSE source_8 | A1170444 | 383 | 384 | FALSE source_8 | A1170509 | 383 | 384 | FALSE source_8 | A1170531 | 383 | 384 | FALSE source_8 | A1170533 | 382 | 384 | FALSE source_8 | A1170538 | 383 | 384 | FALSE source_8 | A1170542 | 382 | 384 | FALSE source_8 | A1170543 | 383 | 384 | FALSE source_8 | A1170544 | 383 | 384 | FALSE source_8 | A1166130 | 383 | 384 | FALSE source_8 | A1166132 | 382 | 384 | FALSE source_8 | A1166134 | 383 | 384 | FALSE source_8 | A1166137 | 383 | 384 | FALSE
NinoDui commented 1 year ago

Cool! Glad to know the issue led to limited effect. Thanks for all the checking and effort behind.

I am taking on DL Model experiments based on the images and I could exclude the issued one manually. That's not blocking for short.

Still hope to hear the issue be settled and benefit a larger group of researchers. You're actually doing something not only meaningful but cool.

Best regards, NinoDui

This is so helpful! Thank you very much.

Looking at your CSV file, I am relived to note that the only issue across the entire dataset is with that one plate you reported originally. For all others, there are more wells with images than with profiles, and this can happen, as mentioned in one of my previous comments above.

We will recreate the load data parquet for that plate.

This might take a while until we get to it. Will that block you?

Thanks again!