Closed grabear closed 4 years ago
@lgeistlinger Good idea on the new issue.
AnnotationDBI::mapIDs
is used in 3 internal functions in mapIds.R
, but it looks like it might only be relevant here:
https://github.com/lgeistlinger/EnrichmentBrowser/blob/4357b8004cdcd70093ec69b5b1448bd019fad77c/R/mapIds.R#L201-L219 https://github.com/lgeistlinger/EnrichmentBrowser/blob/4357b8004cdcd70093ec69b5b1448bd019fad77c/R/mapIds.R#L272-L290
@lgeistlinger
https://github.com/grabearummc/EnrichmentBrowser/commit/dbe316a84f258fe76d633e92356be2501683a784
Here's my fix. If you are happy with it, then I will create a PR. Other solutions might involve:
SE
objects that might look like this in the idMap
function:
nrowData(SE)[["ENSEMBL.id"]] <- names(SE)
names(SE) <- gsub("\\..*", "", names(SE))
Thanks. Can you provide an example where the mapping results in such versioned ENSEMBL gene ids? If that's caused by outdated mappings in the corresponding org.db package, then it is worth fixing it directly there instead of working around it downstream.
I was removing the ENSEMBL versioning information in my commit before doing the mapping with AnnotationDBI. AnnotationDBI::mapIds
will break if your keys/ids are ENSEMBL and have versioning.
I don't think that the org.db packages use the versioning information (which is the issue), but I could be wrong. Is that what you mean?
For me, the version info is introduced way before my R pipeline. For this instance specifically, I was using salmon/gencode for quantification.
I see we are talking here about providing versioned IDs to the ID mapping. Well, although I can see that this might be handy to have, I think in this case, it's best to leave it up to the user to provide valid (here: unversioned) gene IDs that are compatible with mapping via AnnotationDBI::mapIds
. Good thing is, here it seems to be just a gsub
command to have the IDs ready for the mapping.
Ok, thanks for the response @lgeistlinger. When I have some extra time, I will get some feedback from the AnnotationDBI repository, and link back to this issue.
It might be even worth understanding why your GENCODE reference would include versioned gene IDs in the first place?
You got me curious @lgeistlinger . I definitely had to google some of this so let me know if you have some insight.
ENSEMBL ids contain a version (ENS***.Version), so that when things change......
......the older references can be preserved. https://m.ensembl.org/Help/Faq?id=488 http://uswest.ensembl.org/info/genome/stable_ids/index.html
GENCODE is a project to create super accurate mouse/human genetic data from ENSEMBL. So they should have the versioning info. http://uswest.ensembl.org/Help/Faq?id=303 https://www.gencodegenes.org/pages/faq.html
My question is why doesn't the OrgDbs contain the versioning information? Is it just because OrgDbs primarily map to the Entrez Ids?
I think it reflects the scope of the two different applications (read mapping vs gene ID mapping).
For read mapping, different versions of a gene ID can result in updates to the genomic coordinates / chromosomal location of the gene (eg when a novel transcript is annotated to the gene). This, in turn, can result also in a different read count for that gene, with eg more reads falling onto the updated coordinates.
For gene ID mapping, however, the version does not matter, as, when eg mapping from ENSEMBL IDs to gene symbols, ENSG00000002919 maps to SNX11, and thus so does ENSG00000002919.1, ENSG00000002919.2, ..., ENSG00000002919.14. Therefore AnnotationDbi also doesn't care about the versions. At least this is how I understand it.
Consider adding in functionality for
EnrichmentBrowser::idMap
so that it automatically validates/converts ENSEMBL ids fromid.version
toid
(e.g.ENSG00000002919.14
toENSG00000002919
). Try to conserveid.version
by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.Originally posted by @grabearummc in https://github.com/lgeistlinger/EnrichmentBrowser/issues/23#issuecomment-678355126