saezlab / scverse_hackathon

MIT License
0 stars 0 forks source link

User-story: gene ID remapping #1

Open grst opened 1 year ago

grst commented 1 year ago

When working with single-cell datasets from different sources, a common task is to harmonized gene identifiers.

Let's assume, we have an anndata object with gene symbols as obs_names:

>>> adata.obs_names
Index(['MIR1302-10', 'FAM138A', 'OR4F5', 'RP11-34P13.7', 'RP11-34P13.8',
       'AL627309.1', 'RP11-34P13.14', 'RP11-34P13.9', 'AP006222.2',
       'RP4-669L17.10',
       ...
       'KIR3DL2-1', 'AL590523.1', 'CT476828.1', 'PNRC2-1', 'SRSF10-1',
       'AC145205.1', 'BAGE5', 'CU459201.1', 'AC002321.2', 'AC002321.1'],
      dtype='object', name='index', length=32738)

and I want to map them to the latest ensembl ID.

It would be great to have a function that works roughly like this:

>>> remap_gene_ids(adata, source="hgnc", target="ensembl", key_added="ensg")
>>> adata.var["ensg"]
index
MIR1302-10      ENSG00000243485
FAM138A         ENSG00000237613
OR4F5           ENSG00000186092
RP11-34P13.7    ENSG00000238009
RP11-34P13.8    ENSG00000239945
ivirshup commented 1 year ago

I would add here that you probably need to be able to specify the version of ensembl. E.g. you don't want to be getting hg38 values if you're trying to integrate with hg19 data.

slobentanzer commented 1 year ago

pretty close to https://pypath.omnipathdb.org/#id-conversion functionality. could probably be wrapped without too much of a hassle. may make sense to do the module extraction we had planned in the context of the hackathon, though. perhaps we can also include a little refactor for performance. @deeenes, what do you think?