nf-core / differentialabundance

Differential abundance analysis for feature/ observation matrices from platforms such as RNA-seq
https://nf-co.re/differentialabundance
MIT License
64 stars 37 forks source link

Continuous covariates #247

Open BEFH opened 8 months ago

BEFH commented 8 months ago

Description of feature

Many users including us would like to be able to include continuous covariates for the differential expression tests.

They could be added to the contrasts sheet like so:

id,variable,reference,target,blocking,continuous
condition_control_treated,condition,control,treated,,age;rin
condition_control_treated_blockrep,condition,control,treated,replicate;batch,

Then in the modules/nf-core/deseq2/differential/templates/deseq_de.R:

At the beginning:

opt <- list(
    output_prefix = ifelse('$task.ext.prefix' == 'null', '$meta.id', '$task.ext.prefix'),
    count_file = '$counts',
    sample_file = '$samplesheet',
    contrast_variable = '$contrast_variable',
    reference_level = '$reference',
    target_level = '$target',
    blocking_variables = NULL,
    continuous_variables = NULL,
    control_genes_file = '$control_genes_file',
    transcript_lengths_file = '$transcript_lengths_file',
    sizefactors_from_controls = FALSE,
    gene_id_col = "gene_id",
    sample_id_col = "experiment_accession",
    subset_to_contrast_samples = FALSE,
    exclude_samples_col = NULL,
    exclude_samples_values = NULL,
    test = "Wald",
    fit_type = "parametric",
    sf_type = 'ratio',
    min_replicates_for_replace = 7,
    use_t = FALSE,
    lfc_threshold = 0,
    alt_hypothesis = 'greaterAbs',
    independent_filtering = TRUE,
    p_adjust_method = 'BH',
    alpha = 0.1,
    minmu = 0.5,
    vs_method = 'vst', # 'rlog', 'vst', or 'rlog,vst'
    shrink_lfc = TRUE,
    cores = 1,
    vs_blind = TRUE,
    vst_nsub = 1000
)

Then for building the variable list:

contrast_variable <- make.names(opt\$contrast_variable)
blocking.vars <- c()

if (!contrast_variable %in% colnames(sample.sheet)) {
    stop(
        paste0(
        'Chosen contrast variable \"',
        contrast_variable,
        '\" not in sample sheet'
        )
    )
} else if (any(!c(opt\$reflevel, opt\$treatlevel) %in% sample.sheet[[contrast_variable]])) {
    stop(
        paste(
        'Please choose reference and treatment levels that are present in the',
        contrast_variable,
        'column of the sample sheet'
        )
    )
} else if (is_valid_string(opt\$blocking_variables)) {
    blocking.vars = make.names(unlist(strsplit(opt\$blocking_variables, split = ';')))
    if (!all(blocking.vars %in% colnames(sample.sheet))) {
        missing_block <- paste(blocking.vars[! blocking.vars %in% colnames(sample.sheet)], collapse = ',')
        stop(
            paste(
                'Blocking variables', missing_block,
                'do not correspond to sample sheet columns.'
            )
        )
    }
}

if (is_valid_string(opt\$continuous_variables)) {
    continuous.vars = make.names(unlist(strsplit(opt\$continuous_variables, split = ';')))
    if (!all(continuous.vars %in% colnames(sample.sheet))) {
        missing_continuous <- paste(continuous.vars[! continuous.vars %in% colnames(sample.sheet)], collapse = ',')
        stop(
            paste(
                'Continuous variables', missing_continuous,
                'do not correspond to sample sheet columns.'
            )
        )
    }
}

And finally for building the model:

# Now specify the model. Use cell-means style so we can be explicit with the
# contrasts

model <- '~ 0'

if (is_valid_string(opt\$blocking_variables)) {
    model <- paste(model, paste(blocking.vars, collapse = ' + '), sep=' + ')
}

if (is_valid_string(opt\$continuous_variables)) {
    model <- paste(model, paste(continuous.vars, collapse = ' + '), sep=' + ')
}

# Make sure all the appropriate variables are factors

for (v in c(blocking.vars, contrast_variable)) {
    sample.sheet[[v]] <- as.factor(sample.sheet[[v]])
}

for (v in continuous.vars) {
    sample.sheet[[v]] <- as.numeric(sample.sheet[[v]])
}

# Variable of interest goes last, see
# https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs

model <- paste(model, contrast_variable, sep = ' + ')

However, there is still clearly work to be done in the Nextflow pipeline and documentation, and I don't know what to change there. If anyone could provide guidance on what to edit and what testing to do, I could do a pull request, but I don't know what to do at this point.

WackerO commented 8 months ago

Hey there @BEFH, it would indeed be great if you could open some PRs for this!

The first thing would be to change the deseq2 module.

I think if possible, you should also provide a test for the new functionality. The current deseq2 tests are done with the mus_musculus dataset, you can find the links to the files here. If the samplesheet does not contain any info that is suitable to use as a covariate, you could check the test-datasets repo; maybe there is some dataset that already works, otherwise you can possibly provide something suitable?

BEFH commented 8 months ago

I think there needs to be some discussion over how to do this. My suggestion of an additional contrast sheet column is one way to do it, or we could automatically detect if the column looks like a float (double in R) before turning it into a factor and instead make sure it's coerced to a double, or there could be a pipeline option to list continuous variables (probably preferable).

WackerO commented 8 months ago

Hmm, I'm not sure if adding another param would be the best solution. As this concerns the contrasts, I think probably your other two suggestions make more sense. The check for numeric values is the easiest solution I think (but must make sure R understands negatives and decimals; yml sometimes struggles with that). The additional column in the contrasts file is of course more specific, meaning that it would allow people to also have factor variables that are numeric...However, @pinin4fjords, I think for this one would also need to change the validator, no?

pinin4fjords commented 8 months ago

This really just needs to be done in more or less the same way as the blocking. So a column for continuous covariates in the contrasts file. Yes, we'd have to make sure the validator understood about the new column, and verified that the listed covariates were present etc.

There's no reason the existing test data can't be adapted, we can add a new fake continuous covariate column- the test data doesn't need to have scientific validity.

BEFH commented 8 months ago

My one concern with the design here is whether DESeq2 uses a Type I SS or a Type II SS. If it uses a Type II model, then linear covariates and blocking factors can be in different columns with no issues because order doesn't matter. If it's type I, I don't actually know.

WackerO commented 8 months ago

I'm not quite sure what you mean by Type I or II SS, but according to the DESeq2 vignette, the order of variables in the design does not matter (search for order of, second result).

ctuni commented 2 months ago

Hello! I came across this issue because I was searching for exactly this implementation, and I have a question before opening some PRs: I have not checked all the datasets available yet (the mus_musculus samplesheet does not have a variable that could be used as continuous, but I'll keep looking). If the strategy to follow is similar to the "blocking" logic, the contrasts file of the test dataset, should also include a new column with the "continuous" field right? Like this:

id,variable,reference,target,blocking,continuous
treatment_mCherry_hND6_,treatment,mCherry,hND6,,
treatment_mCherry_hND6_sample_number,treatment,mCherry,hND6,sample_number,
treatment_mCherry_hND6_sample_number,treatment,mCherry,hND6,,age

And as @pinin4fjords mentioned, the corresponding samplesheet should have the corresponding column filled with integer data. This means that implementing the use of continuous variables in this pipeline needs at least a PR on the test-datasets repo, another one on the modules repo to change the DESeq2 module and its tests, and finally another one on this repo, to add the new "continuous" option on the modules.conf file (and maybe other changes).

Let me know if I got the logic correctly and I can start with some PRs :) Thanks!

WackerO commented 2 months ago

Hey @ctuni, yes, that sounds about right. If you find a suitable dataset in the test-datasets repo, I think you'll only need to open a PR for the contrasts file, otherwise maybe you can modify an existing dataset on there (or provide an own dataset). And then indeed next is the modules PR and finally the change here in the pipeline :)