How does one identify replicates?

williamdee1 commented 1 year ago

Hi, am I correct in thinking that if two datapoints in the well.csv.gz share the same 'Metadata_JCP2022' identifier then they are replicates of one another? I guess this excludes 'JCP2022_999999' which I think represents non-Compound perturbations.

If this is the case, most compound ids seem to have 3, 4 or 5 replicates within the well-level data, however some have magnitudes more than that - for example 'JCP2022_037716' which appears to have 9,099 associated datapoints as per the latest 'well.csv.gz' file (see image below).

Is this the correct way to think about the identifiers, and if so why do some have so many replicates within the dataset?

Thank you in advance for your help, Will

niranjchandrasekaran commented 1 year ago

Hi Will,

Thanks for your interest in our dataset! My answers are below.

am I correct in thinking that if two datapoints in the well.csv.gz share the same 'Metadata_JCP2022' identifier then they are replicates of one another?

That's correct.

I guess this excludes 'JCP2022_999999' which I think represents non-Compound perturbations.

JCP2022_999999 are untreated wells (these wells contain only cells).

If this is the case, most compound ids seem to have 3, 4 or 5 replicates within the well-level data, however some have magnitudes more than that - for example 'JCP2022_037716' which appears to have 9,099 associated datapoints as per the latest 'well.csv.gz' file (see image below). Is this the correct way to think about the identifiers, and if so why do some have so many replicates within the dataset?

There are four replicates of eight positive control compounds on every compound plate. JCP2022_037716 is one of them. Hence, the number of replicates. You will also find these compounds in the ORF and CRISPR plates (but the number of replicates and the number of compounds will be different).

We will soon have a preprint on biorxiv which will provide more details about the experiment and plate design.

williamdee1 commented 1 year ago

Hi, thanks very much for the quick and detailed response! I am eagerly awaiting the pre-print :)

Is there a rough timeline for that and for the 'Curated annotations' for the compounds (I can start a new issue if you'd prefer I ask this separately?)

shntnu commented 1 year ago

Is there a rough timeline for that and for the 'Curated annotations' for the compounds (I can start a new issue if you'd prefer I ask this separately?)

@williamdee1

We are tracking that Q here https://github.com/jump-cellpainting/datasets/issues/13#issuecomment-1315434338. As Niranj mentioned, a pubchem query using the InChIKey would already work.

Also, the file inchikey_chembl.csv.gz in https://github.com/jump-cellpainting/compound-annotator has ChEMBL ids for ~30k of the compounds.

We are hoping to wrap this up by mid-March 🤞

Do you have any specific annotations in mind that you'd like this dataset to have?

williamdee1 commented 1 year ago

Ah that would be fantastic, thanks for letting me know! :)

MOA annotations for the compounds would be great if possible?

jump-cellpainting / datasets

How does one identify replicates? #49