mahmoodlab / HEST

HEST: Bringing Spatial Transcriptomics and Histopathology together - NeurIPS 2024
Other
164 stars 12 forks source link

HESTData: provide util to map ensemble ID to gene name #71

Closed konst-int-i closed 1 week ago

konst-int-i commented 2 weeks ago

This PR

To dos

Run instructions

from hest import iter_hest, ensembleID_to_gene

# three samples with ensemblIDs as var_names
id_list = ['SPA118', 'SPA117', 'SPA116']

for st in iter_hest('/home/iain/kh/ssd/hest_data/', id_list=id_list):

    print(any([var_name.startswith("ENSG") for var_name in st.adata.var_names]))
    print(st.adata.var_names[:5])

    st_updated = ensembleID_to_gene(st)

    print(any([var_name.startswith("ENSG") for var_name in st_updated.adata.var_names]))
    print(st_updated.adata.var_names[:5])

Expected output:

True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')
True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')
True
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
False
Index(['TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM'], dtype='object', name='gene_name')
pauldoucet commented 2 weeks ago

Hi @konst-int-i, Have you tried on TENX24? It gives me 0 valid genes, weird

konst-int-i commented 2 weeks ago

Hi @konst-int-i, Have you tried on TENX24? It gives me 0 valid genes, weird

Yes, that's because TENX24 doesn't contain any ensemble IDs and currently the default behavior was to invalidate genes without a mapping. Changed it to keep genes without mapping in 4e369f9

pauldoucet commented 1 week ago

looks good to me, thanks for the addition!