jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

how to define label in ORF dataset (cpg0016) #70

Closed liujunhongznn closed 1 year ago

liujunhongznn commented 1 year ago

thank you for your good job on this great dataset! we would pre-train a image classification model with dataset cpg0016, while I am confused about how to define the label for the classification task correctly, is the column "Metadata_JCP2022" the label?I found that in COMPOUND part and CRISPR part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID"/"Metadata_InChIKey" are one-one mapping, it seems that "Metadata_JCP2022" could be the label, while in ORF part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID" are one-to-many mapping, it means that one Metadata_NCBI_Gene_ID has more than one Metadata_JCP2022 id. My question is how to define label for classification task in ORF dataset?Metadata_NCBI_Gene_ID?or Metadata_JCP2022? thanks a lot!

niranjchandrasekaran commented 1 year ago

Hi @liujunhongznn, my recommendation would be to use Metadata_JCP2022 as the label. As you observed, the mapping between Metadata_JCP2022 and Metadata_NCBI_Gene_ID could be many to one because, in the ORF dataset, there are multiple ORF reagents that target the same gene. We don't have the information about the differences in the ORF reagents that target the same gene (as we used a pre-built ORF library for our data production); therefore, it would be better to treat these ORF reagents to be distinct.

liujunhongznn commented 1 year ago

Hi @liujunhongznn, my recommendation would be to use Metadata_JCP2022 as the label. As you observed, the mapping between Metadata_JCP2022 and Metadata_NCBI_Gene_ID could be many to one because, in the ORF dataset, there are multiple ORF reagents that target the same gene. We don't have the information about the differences in the ORF reagents that target the same gene (as we used a pre-built ORF library for our data production); therefore, it would be better to treat these ORF reagents to be distinct.

ok, and I also found that the "Metadata_JCP2022" in some rows (orf.csv.gz) have no corresponding "Metadata_NCBI_Gene_ID", which means some entries in "Metadata_NCBI_Gene_ID" column are nan, could these rows be used for dataset construction for my classification task?

niranjchandrasekaran commented 1 year ago

We are unsure what genes these ORFs correspond to, so I think it is better to remove these from your task, just to be on the safer side.

liujunhongznn commented 1 year ago

I see, thank you very much!