Closed jdrnevich closed 6 years ago
Thanks for reporting + I'll be looking into this!
I just learned about lengths()!! No need for the vapply:
sum(lengths(y) > 1)
Short follow-up:
I confirm that this is a bug and has been fixed in devel (EnrichmentBrowser 2.11.8) and release (EnrichmentBrowser 2.10.5).
Thanks for providing illustrative code - that made it easy!
> airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
'select()' returned 1:many mapping between keys and columns
Excluded 1133 genes without a corresponding to.ID
Encountered 104 from.IDs with >1 corresponding to.ID (a single to.ID was chosen for each of them)
Encountered 8 to.IDs with >1 from.ID (a single from.ID was chosen for each of them)
I'm actually also following up on your suggestions on selecting IDs data-driven (eg lowest p-value). This is going to be introduced as a new feature in the devel version these days and will contain the following options:
to
can also be a rowData
column of se
to support user-curated mappings in which conflicts have been resolved manuallyidMap
that allow user-defined strategies for dealing with conficts - instances include selecting ids with lowest p-value, max(abs(fc)), etc ... comparable to the argument FUN
in AnnotationDbi::mapIds
probe2gene
conversion - as probe2gene
and idMap
essentially embody the same problem as you correctly pointed out Stay tuned.
Hi @jdrnevich
I have accordingly extended the idMap
function in the new devel version 2.11.10:
(already available from Github and presumably available from Bioconductor tomorrow)
https://bioconductor.org/packages/devel/bioc/html/EnrichmentBrowser.html
Let's consider a dummy dataset for demonstration:
# create an expression dataset with 3 genes and 3 samples
se <- makeExampleData("SE", nfeat=3, nsmpl=3)
names(se) <- paste0("ENSG00000000", c("003","005", "419"))
# standard usage as before
mse <- idMap(se, org="hsa", from="ENSEMBL", to="ENTREZID")
New features:
to
can also be a rowData
column of se
to support user-curated mappings
in which conflicts have been resolved manually:# user-defined mapping
rowData(se)$MYID <- c("g1", "g1", "g2")
mse <- idMap(se, to="MYID")
idMap
that allow user-defined strategies for
dealing with conficts - instances include selecting ids with lowest p-value,
max(abs(fc)), etc ... comparable to the argument FUN
in AnnotationDbi::mapIds
The idMap
function has two new arguments:
multi.to
: determines how to resolve 1:many mappings, i.e. multiple to.IDs
for a single from.ID. This is passed on to the multiVals
argument of
AnnotationDbi::mapIds
and can thus take several pre-defined values, but also
the form of a user-defined function. However, note that this requires that a
single to.ID is returned for each from.ID. Default is "first"
,
which accordingly returns the first to.ID mapped onto the respective from.ID.
multi.from
: determines how to resolve many:1 mappings, i.e. multiple
from.IDs mapping to the same to.ID. Pre-defined options include:
rowData
column PVAL
of se
),rowData
column FC
of se
).Note that a user-defined function can also be supplied for custom behaviors.
This will be applied for each case where there are multiple from.IDs for a
single to.ID, and accordingly takes the arguments ids
and se
.
The argument ids
corresponds to the multiple from.IDs from which a
single ID should be chosen, e.g. via information available in argument
se
.
## select from.ID with lowest p-value
pcol <- configEBrowser("PVAL.COL")
rowData(se)[[pcol]] <- c(0.001, 0.32, 0.15)
mse <- idMap(se, to="MYID", multi.from="minp")
## ... or using a customized function
maxScore <- function(ids, se)
{
scores <- rowData(se, use.names=TRUE)[ids, "SCORE"]
ind <- which.max(scores)
return(ids[ind])
}
rowData(se)$SCORE <- c(125.7, 33.4, 58.6)
mse <- idMap(se, to="MYID", multi.from=maxScore)
Note that I have also accordingly refactored probe2gene
to be based on idMap
.
As such all demonstrated features above are also available when resolving eg
multiple probes mapping to the same gene.
Demonstrated here for the ALL
dataset, where we select the probe with lowest
p-value.
library(EnrichmentBrowser)
library(hgu95av2.db)
data(ALL, package="ALL")
# select patients
ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]
# DE analysis on probe level
all.eset$GROUP <- ifelse(all.eset$mol.biol == "BCR/ABL", 1, 0)
allSE <- deAna(all.eset)
# convert to gene, keeping the most significantly DE probe
allSE <- probe2gene(allSE, from="PROBEID", to="ENTREZID", multi.from="minp")
Looks great! I'll try them out tomorrow. Thanks so much
Closing this - feel free to reopen or follow-up if needed.
Running through your BioC2018 workshop:
The first message about excluding genes without a to.ID is correct. But the second message is not correct. The number 8 is actually the number of to.IDs with > 1 from.IDs (e.g. multiple ensembl mapping to same entrezid). All from.IDs that map to > 1 to.ID already had the multiple values dropped in the call in .idmap to mapIDs:
The default for mapIds's multiVals argument is "first" so only the first entry is used, which is probably fine. You could try to add the multiVal argument to idMap to push to mapIds, but you'd have to restrict it so that only 1 to.ID was returned. If you want to report the number of from.IDs that had > 1 to.ID, you could add something like:
Finally, to resolve cases where >1 from.ID map to the same to.ID, your current implementation is to use the first one. However, if the rowData of the summarizedExperiment has p-values, I'd like the option of keeping the from.ID that had the lowest p-value. Thanks!!