snap-stanford / GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
MIT License
189 stars 38 forks source link

Duplicated gene names in replogle_k562_essential #68

Closed zhan8855 closed 3 months ago

zhan8855 commented 3 months ago

Hi, I have the following error,

pert_data = PertData('./data')
pert_data.load(data_name='replogle_k562_essential')
gene_name = pert_data.adata.var['gene_name'].tolist()

assert len(gene_name) == len(set(gene_name))
>>> AssertionError
yhr91 commented 3 months ago

Yes, I believe there is one gene name that is repeated twice in this dataset. This doesn't impact the model because it mainly relies on gene indices during training and prediction.

zhan8855 commented 3 months ago

But is it biologically sound?

yhr91 commented 3 months ago

Yes, gene names are arbitrary constructs given to segments of the DNA. For example, often multiple ENSEMBL ids map to the same gene name

zhan8855 commented 3 months ago

Thank you very much!!