Closed liujunhongznn closed 1 year ago
Hi @liujunhongznn, my recommendation would be to use Metadata_JCP2022
as the label. As you observed, the mapping between Metadata_JCP2022
and Metadata_NCBI_Gene_ID
could be many to one because, in the ORF dataset, there are multiple ORF reagents that target the same gene. We don't have the information about the differences in the ORF reagents that target the same gene (as we used a pre-built ORF library for our data production); therefore, it would be better to treat these ORF reagents to be distinct.
Hi @liujunhongznn, my recommendation would be to use
Metadata_JCP2022
as the label. As you observed, the mapping betweenMetadata_JCP2022
andMetadata_NCBI_Gene_ID
could be many to one because, in the ORF dataset, there are multiple ORF reagents that target the same gene. We don't have the information about the differences in the ORF reagents that target the same gene (as we used a pre-built ORF library for our data production); therefore, it would be better to treat these ORF reagents to be distinct.
ok, and I also found that the "Metadata_JCP2022" in some rows (orf.csv.gz) have no corresponding "Metadata_NCBI_Gene_ID", which means some entries in "Metadata_NCBI_Gene_ID" column are nan, could these rows be used for dataset construction for my classification task?
We are unsure what genes these ORFs correspond to, so I think it is better to remove these from your task, just to be on the safer side.
I see, thank you very much!
thank you for your good job on this great dataset! we would pre-train a image classification model with dataset cpg0016, while I am confused about how to define the label for the classification task correctly, is the column "Metadata_JCP2022" the label?I found that in COMPOUND part and CRISPR part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID"/"Metadata_InChIKey" are one-one mapping, it seems that "Metadata_JCP2022" could be the label, while in ORF part of the dataset, "Metadata_JCP2022" and "Metadata_NCBI_Gene_ID" are one-to-many mapping, it means that one Metadata_NCBI_Gene_ID has more than one Metadata_JCP2022 id. My question is how to define label for classification task in ORF dataset?Metadata_NCBI_Gene_ID?or Metadata_JCP2022? thanks a lot!