Closed millerh1 closed 3 years ago
Hi,
Thanks for the message. We provided all the symbol
info from AnnotationDbi::mapIds()
as shown at https://github.com/leekgroup/recount/blob/master/R/reproduce_ranges.R#L111-L114, which then enables users to choose which symbol they want to use for each gene. You can coerce it back to a regular character vector by for example, choosing the first symbol for each gene with rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1)
, which is what AnnotationDbi::mapIds()
will do if you want just 1 symbol per gene by default (it has several options).
> head(sapply(rowRanges(rse_gene)$symbol, "[", 1))
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
"TSPAN6" "TNMD" "DPM1" "SCYL3" "C1orf112"
ENSG00000000938
"FGR"
I adapted your code a bit (note the res$title
vs rse$title
difference too) and it works at comparable speeds on my slower laptop than your Ubuntu machine (based on 2 timed tests for each, which for benchmarking is not enough but well, we just wanted a general idea).
library(recount)
library(DESeq2)
library(SummarizedExperiment)
rse_gene <- rse_gene_SRP009615
## Coerce symbol to character
system.time({
rse <- SummarizedExperiment(
assays = assays(rse_gene),
colData = colData(rse_gene),
rowRanges = rowRanges(rse_gene)
)
rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1)
dds <- DESeqDataSet(rse, design = ~condition)
dds <- DESeq(dds)
})
# Timer without ranges
system.time({
rse <- SummarizedExperiment(
assays = assays(rse_gene),
colData = colData(rse_gene),
# rowRanges = rowRanges(rse_gene)
)
rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
dds <- DESeqDataSet(rse, design = ~condition)
dds <- DESeq(dds)
})
So well, here it's not really a recount
bug. It's just a matter of how you use the information provided and how we provide more than you might need.
Best, Leo
Thank you for the update! I will do this in the future.
Hello,
I am usually a happy camper with
recount2
and very thankful for all the work you all have put into this tool! That being said, I have noticed a bug recently in which using theRangedSummarizedExperiment
objects from recount in R causes severe slowdowns to occur. I have tested this on multiple machines now and find this only happens with the RSE objects from recount and not with typical seq datasets that I analyze.After a little while of digging into the objects, I noticed that the slowdown appears to be caused by the
rowRanges
of the recount2 objects, specifically thesymbol
column. Please see my example here:And here is the output of running this in the console:
As you can see, with the
rowRanges
included, the code took ~25x longer to finish running. However, simplyNULL
ing thesymbol
column of therowRanges
was able to prevent this slowdown. I think the issue is that thesymbol
column was aCharacterList
which may perform inefficiently for these application. Anyways, hopefully this helps -- thanks again for all the work you and your team does!!Session info: