hbc / bcbioRNASeq

R package for bcbio RNA-seq analysis.
https://bioinformatics.sph.harvard.edu/bcbioRNASeq
GNU Affero General Public License v3.0
58 stars 21 forks source link

Adding multible bcbio run in bcbio.rda object #141

Closed kokyriakidis closed 3 years ago

kokyriakidis commented 4 years ago

@mjsteinbaugh

Having 199 samples to process and limited amount of storage 6TB, I would like to know if I can run bcbio several times and load and merge all bcbio runs in a single rda file?

mjsteinbaugh commented 4 years ago

Yes, there is a way to do this that works with DESeq2. Start your bcbio runs in batches, and I'll work on adding an example to the bcbioRNASeq vignette. In the meantime, here's a rough draft of how you can do it:

library(bcbioRNASeq)
library(DESeq2)

## This will return a RangedSummarizedExperiment.
## Repeat stepwise in a loop for each of your bcbioRNASeq runs.
rse_c <- combine(bcb_1, bcb_2)
## You may need to round your counts to integer values before this step works (see the minimal example below).
dds <- DESeqDataSet(rse_c)

The code that handles this is defined in my basejump package, (see combine() for details):

selectMethod(
    f = "combine",
    signature = signature(
        x = "SummarizedExperiment",
        y = "SummarizedExperiment"
    )
)
mjsteinbaugh commented 4 years ago

I'll work on seeing if I can improve the combine() method to support an unlimited number of objects. Currently it only works for merging 2 datasets at a time.

kokyriakidis commented 4 years ago

Thank you very much for the info!

I face another issue pointed out here: https://github.com/bcbio/bcbio-nextgen/issues/3033

Maybe do you have any clue?

mjsteinbaugh commented 4 years ago

Here's a minimal example that should work that you can test out:

library(bcbioRNASeq)
library(DESeq2)

data(bcb, package = "bcbioRNASeq")

bcb_x <- bcb
colnames(bcb_x) <- paste0(colnames(bcb_x), "_x")

bcb_y <- bcb
colnames(bcb_y) <- paste0(colnames(bcb_y), "_y")

rse_c <- combine(x = bcb_x, y = bcb_y)

class(rse_c)
# [1] "RangedSummarizedExperiment"
# attr(,"package")
# [1] "SummarizedExperiment"

dim(rse_c)
# [1] 100  12

# DESeq2 currently only supports integer counts.
counts <- counts(rse_c)
counts <- round(counts, digits = 0L)
mode(counts) <- "integer"
counts(rse_c) <- counts

dds <- DESeqDataSet(rse_c, design = ~ 1L)
class(dds)
# [1] "DESeqDataSet"
# attr(,"package")
# [1] "DESeq2"

# Next set your design formula with `design()` and eventually run `DESeq()`.
kokyriakidis commented 4 years ago

Do I lose something if I add the option tools_off: [upload_alignment]? Does bcbioRNASeq uses the bam files somehow?

mjsteinbaugh commented 4 years ago

@kokyriakidis No the bcbioRNASeq R package doesn't parse the BAM files currently. I tend to only use those for creating an IGV session.

kokyriakidis commented 4 years ago

@mjsteinbaugh I see that bcbioRNAseq has 3 templates QC, DE. Functional. WIll running fastrna-seq mode makes me be able to run only DE and Functional templates? Running in RNA-seq mode will just give info for the QC template, or it helps in the DE and Functional templates also?

mjsteinbaugh commented 4 years ago

Correct, the QC template only works for bcbio runs with alignment via STAR or HISAT2, since this outputs additional QC metrics. You can see these metrics in the colData() of the object, when applicable. For differential expression, you're good to go. Just run this step:

# Assuming your dataset is named "bcbioRNASeq" here.
dds <- as(bcbioRNASeq, "DESeqDataSet")
mjsteinbaugh commented 3 years ago

@kokyriakidis I'm going to take a stab at improved support for combining bcbio-nextgen runs this month in bcbioRNASeq. I'll report back with an update.

mjsteinbaugh commented 3 years ago

Closing this issue for now, as this feature request has been added to the TODO list. Will try to add this in a future update.