idMap/.idmap not giving the correct numbers of genes lost during conversion

jdrnevich commented 6 years ago

Running through your BioC2018 workshop:

> airSE2 <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
'select()' returned 1:many mapping between keys and columns
Excluded 1133 genes without a corresponding to.ID
Encountered 8 from.IDs with >1 corresponding to.ID (a single to.ID was chosen for each of them)

The first message about excluding genes without a to.ID is correct. But the second message is not correct. The number 8 is actually the number of to.IDs with > 1 from.IDs (e.g. multiple ensembl mapping to same entrezid). All from.IDs that map to > 1 to.ID already had the multiple values dropped in the call in .idmap to mapIDs:

> temp1 <- select(org.Hs.eg.db, keys = rownames(airSE), keytype = "ENSEMBL",columns = "ENTREZID")
'select()' returned 1:many mapping between keys and columns
> #'select()' returned 1:many mapping between keys and columns
> temp2 <- mapIds(org.Hs.eg.db, keys = rownames(airSE), keytype = "ENSEMBL",column = "ENTREZID")
'select()' returned 1:many mapping between keys and columns
> dim(temp1)
[1] 13043     2
> # 13043     2
> length(unique(temp1$ENSEMBL))
[1] 12937
> length(temp2)
[1] 12937

The default for mapIds's multiVals argument is "first" so only the first entry is used, which is probably fine. You could try to add the multiVal argument to idMap to push to mapIds, but you'd have to restrict it so that only 1 to.ID was returned. If you want to report the number of from.IDs that had > 1 to.ID, you could add something like:

> y <- mapIds(org.Hs.eg.db, keys = rownames(airSE), keytype = "ENSEMBL",
+        column = "ENTREZID", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sum(vapply(y, length, numeric(1)) > 1)
[1] 104

Finally, to resolve cases where >1 from.ID map to the same to.ID, your current implementation is to use the first one. However, if the rowData of the summarizedExperiment has p-values, I'd like the option of keeping the from.ID that had the lowest p-value. Thanks!!

lgeistlinger commented 6 years ago

Thanks for reporting + I'll be looking into this!

jdrnevich commented 6 years ago

I just learned about lengths()!! No need for the vapply:

sum(lengths(y) > 1)

lgeistlinger commented 6 years ago

Short follow-up:

I confirm that this is a bug and has been fixed in devel (EnrichmentBrowser 2.11.8) and release (EnrichmentBrowser 2.10.5).

Thanks for providing illustrative code - that made it easy!

> airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
'select()' returned 1:many mapping between keys and columns
Excluded 1133 genes without a corresponding to.ID
Encountered 104 from.IDs with >1 corresponding to.ID (a single to.ID was chosen for each of them)
Encountered 8 to.IDs with >1 from.ID (a single from.ID was chosen for each of them)

lgeistlinger commented 6 years ago

I'm actually also following up on your suggestions on selecting IDs data-driven (eg lowest p-value). This is going to be introduced as a new feature in the devel version these days and will contain the following options:

to can also be a rowData column of se to support user-curated mappings in which conflicts have been resolved manually
additional argument(s) to idMap that allow user-defined strategies for dealing with conficts - instances include selecting ids with lowest p-value, max(abs(fc)), etc ... comparable to the argument FUN in AnnotationDbi::mapIds
synchronization with the behavior of probe2gene conversion - as probe2gene and idMap essentially embody the same problem as you correctly pointed out

Stay tuned.

lgeistlinger commented 6 years ago

Hi @jdrnevich

I have accordingly extended the idMap function in the new devel version 2.11.10:

(already available from Github and presumably available from Bioconductor tomorrow)

https://bioconductor.org/packages/devel/bioc/html/EnrichmentBrowser.html

Let's consider a dummy dataset for demonstration:

# create an expression dataset with 3 genes and 3 samples
se <- makeExampleData("SE", nfeat=3, nsmpl=3)
names(se) <- paste0("ENSG00000000", c("003","005", "419"))

# standard usage as before
mse <- idMap(se, org="hsa", from="ENSEMBL", to="ENTREZID")

New features:

to can also be a rowData column of se to support user-curated mappings in which conflicts have been resolved manually:

# user-defined mapping
rowData(se)$MYID <- c("g1", "g1", "g2")
mse <- idMap(se, to="MYID")

additional argument(s) to idMap that allow user-defined strategies for dealing with conficts - instances include selecting ids with lowest p-value, max(abs(fc)), etc ... comparable to the argument FUN in AnnotationDbi::mapIds

The idMap function has two new arguments:

multi.to: determines how to resolve 1:many mappings, i.e. multiple to.IDs for a single from.ID. This is passed on to the multiVals argument of AnnotationDbi::mapIds and can thus take several pre-defined values, but also the form of a user-defined function. However, note that this requires that a single to.ID is returned for each from.ID. Default is "first", which accordingly returns the first to.ID mapped onto the respective from.ID.
multi.from: determines how to resolve many:1 mappings, i.e. multiple from.IDs mapping to the same to.ID. Pre-defined options include:
- 'first' (Default): returns the first from.ID for each to.ID with multiple from.IDs,
- 'minp': selects the from.ID with minimum p-value (according to the rowData column PVAL of se),
- 'maxfc': selects the from.ID with maximum absolute log2 fold change (according to the rowData column FC of se).

Note that a user-defined function can also be supplied for custom behaviors. This will be applied for each case where there are multiple from.IDs for a single to.ID, and accordingly takes the arguments ids and se. The argument ids corresponds to the multiple from.IDs from which a single ID should be chosen, e.g. via information available in argument se.

## select from.ID with lowest p-value
pcol <- configEBrowser("PVAL.COL")
rowData(se)[[pcol]] <- c(0.001, 0.32, 0.15)
mse <- idMap(se, to="MYID", multi.from="minp")

## ... or using a customized function
maxScore <- function(ids, se)
{
    scores <- rowData(se, use.names=TRUE)[ids, "SCORE"]
    ind <- which.max(scores)
    return(ids[ind])
}
rowData(se)$SCORE <- c(125.7, 33.4, 58.6)
mse <- idMap(se, to="MYID", multi.from=maxScore)

lgeistlinger commented 6 years ago

Note that I have also accordingly refactored probe2gene to be based on idMap. As such all demonstrated features above are also available when resolving eg multiple probes mapping to the same gene.

Demonstrated here for the ALL dataset, where we select the probe with lowest p-value.

library(EnrichmentBrowser)
library(hgu95av2.db)
data(ALL, package="ALL")

# select patients
ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]

# DE analysis on probe level
all.eset$GROUP <- ifelse(all.eset$mol.biol == "BCR/ABL", 1, 0)
allSE <- deAna(all.eset)

# convert to gene, keeping the most significantly DE probe
allSE <- probe2gene(allSE, from="PROBEID", to="ENTREZID", multi.from="minp")

jdrnevich commented 6 years ago

Looks great! I'll try them out tomorrow. Thanks so much

lgeistlinger commented 6 years ago

Closing this - feel free to reopen or follow-up if needed.

lgeistlinger / EnrichmentBrowser

idMap/.idmap not giving the correct numbers of genes lost during conversion #6