ENSEMBL ID version conversion

grabear commented 4 years ago

Consider adding in functionality for EnrichmentBrowser::idMap so that it automatically validates/converts ENSEMBL ids from id.version to id (e.g. ENSG00000002919.14 to ENSG00000002919). Try to conserve id.version by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.

gsub("\\..*", "", row.names(ens_table))

Originally posted by @grabearummc in https://github.com/lgeistlinger/EnrichmentBrowser/issues/23#issuecomment-678355126

grabear commented 4 years ago

@lgeistlinger Good idea on the new issue.

AnnotationDBI::mapIDs is used in 3 internal functions in mapIds.R, but it looks like it might only be relevant here:

https://github.com/lgeistlinger/EnrichmentBrowser/blob/4357b8004cdcd70093ec69b5b1448bd019fad77c/R/mapIds.R#L201-L219 https://github.com/lgeistlinger/EnrichmentBrowser/blob/4357b8004cdcd70093ec69b5b1448bd019fad77c/R/mapIds.R#L272-L290

grabear commented 4 years ago

@lgeistlinger

https://github.com/grabearummc/EnrichmentBrowser/commit/dbe316a84f258fe76d633e92356be2501683a784

Here's my fix. If you are happy with it, then I will create a PR. Other solutions might involve:

detecting ENSEMBL ids with version info and then making a change.
same as the previous, but also conserving the original ids in another column in the original object.
- for SE objects that might look like this in the idMap function:
```
nrowData(SE)[["ENSEMBL.id"]] <- names(SE)
names(SE) <- gsub("\\..*", "", names(SE)) 
```

lgeistlinger commented 4 years ago

Thanks. Can you provide an example where the mapping results in such versioned ENSEMBL gene ids? If that's caused by outdated mappings in the corresponding org.db package, then it is worth fixing it directly there instead of working around it downstream.

grabearummc commented 4 years ago

I was removing the ENSEMBL versioning information in my commit before doing the mapping with AnnotationDBI. AnnotationDBI::mapIds will break if your keys/ids are ENSEMBL and have versioning.

I don't think that the org.db packages use the versioning information (which is the issue), but I could be wrong. Is that what you mean?

For me, the version info is introduced way before my R pipeline. For this instance specifically, I was using salmon/gencode for quantification.

lgeistlinger commented 4 years ago

I see we are talking here about providing versioned IDs to the ID mapping. Well, although I can see that this might be handy to have, I think in this case, it's best to leave it up to the user to provide valid (here: unversioned) gene IDs that are compatible with mapping via AnnotationDBI::mapIds. Good thing is, here it seems to be just a gsub command to have the IDs ready for the mapping.

grabearummc commented 4 years ago

Ok, thanks for the response @lgeistlinger. When I have some extra time, I will get some feedback from the AnnotationDBI repository, and link back to this issue.

lgeistlinger commented 4 years ago

It might be even worth understanding why your GENCODE reference would include versioned gene IDs in the first place?

grabearummc commented 4 years ago

You got me curious @lgeistlinger . I definitely had to google some of this so let me know if you have some insight.

ENSEMBL ids contain a version (ENS***.Version), so that when things change......

Genes: increments when the set of transcripts linked to a gene changes
Transcripts: increments when there is a change in a transcript's splicing pattern, chromosome location or a sequence change in the cDNA
Proteins: increments when there is a sequence change in the peptide sequence
Exons: increments when there is a sequence change in the exon genomic sequence

......the older references can be preserved. https://m.ensembl.org/Help/Faq?id=488 http://uswest.ensembl.org/info/genome/stable_ids/index.html

GENCODE is a project to create super accurate mouse/human genetic data from ENSEMBL. So they should have the versioning info. http://uswest.ensembl.org/Help/Faq?id=303 https://www.gencodegenes.org/pages/faq.html

My question is why doesn't the OrgDbs contain the versioning information? Is it just because OrgDbs primarily map to the Entrez Ids?

lgeistlinger commented 4 years ago

I think it reflects the scope of the two different applications (read mapping vs gene ID mapping).

For read mapping, different versions of a gene ID can result in updates to the genomic coordinates / chromosomal location of the gene (eg when a novel transcript is annotated to the gene). This, in turn, can result also in a different read count for that gene, with eg more reads falling onto the updated coordinates.

For gene ID mapping, however, the version does not matter, as, when eg mapping from ENSEMBL IDs to gene symbols, ENSG00000002919 maps to SNX11, and thus so does ENSG00000002919.1, ENSG00000002919.2, ..., ENSG00000002919.14. Therefore AnnotationDbi also doesn't care about the versions. At least this is how I understand it.

lgeistlinger / EnrichmentBrowser

ENSEMBL ID version conversion #24