Add deseq2 differential expression analysis

ggabernet commented 4 years ago

Hi all,

we discussed several times in slack if it would be a good idea or not to add a differential expression analysis to the RNAseq pipeline. This issue would be to formalize a bit the discussion with pros and cons.

As for our experience at QBiC, we have this deseq2 pipeline as we needed to streamline this process as well. In no way I would directly add the whole pipeline as part of nf-core/RNAseq as it is very QBiC-specific, but I could adapt and polish the R script performing the differential expression analysis (and additionally pathway analysis, if there is interest) to be used as part of nf-core/RNAseq, and add its results to the MultiQC report.

Now discussion is open! @drpatelh @apeltzer @ewels

ewels commented 4 years ago

Hooray! I can finally resurrect the issue I started about this topic in 2016 😆 https://github.com/SciLifeLab/NGI-RNAseq/issues/31

apeltzer commented 4 years ago

Providing at least a very basic version of this is not a problem - of course you can do things incorrectly and some people might be doing things incorrectly then (e.g. not using appropriate assumptions on groupings etc pp), but that will happen to them either way.

If we put a "big fat" disclaimer on top of the generated results / report / output, that this is automatically generated data and should be taken with some critical thinking but could e.g. serve as a start - why not add it?

Regarding the pipeline you mentioned: I think generalizing this should be feasible to do without having to entirely redo the entire thing - at least renaming certain bits here and there wouldn't require as much effort as restarting something from scratch?

I only see the con point that it might be used incorrectly, then blaming us here - but we can counter that with proper statements / docs and referring everyone to the downstream tools.

+1 for the pathways, potentially even GSEA ?

ewels commented 4 years ago

My conclusion years ago was that the safest method was to provide a script instead of results. This could be as a Rmarkdown / jupyter notebook etc, with lots of built-in comments and annotation. The pipeline could auto-fill placeholders in the script, such as filenames and some metadata. But it would require the end-user to actually view the code and therefore take more responsibility for the results. DE analysis is also a lot less likely to be routine than the rest of the pipeline, so providing a "starter" script is perhaps more useful in some cases.

The downside is that people have to be able to run the given script, which adds complexity for novice users. But this is kind of by design 😄

ggabernet commented 4 years ago

Providing an Rmarkdown / jupyter notebook seems like a good idea for the end user. For a Bioinformatics facility like us, we need to provide end results, and it is nice to have it run together with the pipeline. I guess we can just continue using our pipeline for this purpose, though.

One disadvantage I see for the notebooks is the dependency problem, but I guess we can show how to create a conda environment from the environment.yml file, just the R/python packages should then also be added there.

ewels commented 4 years ago

Yes - the above comes with the caveat that I last thought about this years ago when it was only us using the pipeline. So there were other considerations, like the fact that we don't routinely collect the metadata required to do the analysis. I'm not saying that it's the best way to go, just noting down my previous thoughts.

What do you think about having this as a second, separate pipeline? Then we avoid adding more complexity to the rnaseq pipeline and it keeps its simple input requirements. This new pipeline could handle the extra metadata required with a sample sheet etc. Further down the line when everything moves to DSL 2 we can always revisit if necessary.

drpatelh commented 4 years ago

I think it would be fairly straightforward to have a "simple" script to do the differential analysis. We could just make the grouping assumptions based on the input design file as I have been doing with the atacseq pipeline: https://github.com/nf-core/atacseq/blob/master/docs/usage.md#--input

Note this also caters for samples re-sequenced multiple times. The information we can get from this will be enough for most differential analyses which in this case uses this script: https://github.com/nf-core/atacseq/blob/master/bin/featurecounts_deseq2.r

The key with the design input here is to have complete control over the input data and how it is named within the pipeline. That way it makes these sorts of analysis plausible and the code legible 😅

Agree with the comments about having a big warning with this sort of analysis though: https://github.com/nf-core/atacseq/blob/master/docs/output.md

"This pipeline uses a standardised DESeq2 analysis script to get an idea of the reproducibility within the experiment, and to assess the overall differential binding. Please note that this will not suit every experimental design, and if there are other problems with the experiment then it may not work as well as expected."

Probably not big enough 🤷‍♂️ Also agree with adding pathway analysis, GSEA and maybe even having a Shiny app to load up the results to work with them interactively? and/or to generate fully documented Rmarkdown reports and/or to use Jupyter.

PS: The strategy above has also been working quite well for the chipseq pipeline.

ggabernet commented 4 years ago

It sounds good to me to start it as a separate pipeline for now, and adding it as a sub-workflow to RNAseq in DSL2.

I agree with @drpatelh that it should be kept rather simple by default.

A shiny app to interactively analyze the results is also a good idea. I know there were similar initiatives at QBiC as well, but it was decided that it required too much maintenance from our side. Would you see this as part of nf-core as well?

For a start, though, a nice MultiQC report would maybe suffice.

apeltzer commented 4 years ago

It sounds good to me to start it as a separate pipeline for now, and adding it as a sub-workflow to RNAseq in DSL2.

Sounds reasonable to me 👍

A shiny app to interactively analyze the results is also a good idea. I know there were similar initiatives at QBiC as well, but it was decided that it required too much maintenance from our side. Would you see this as part of nf-core as well?

Not really - Shiny Apps must be hosted somewhere, e.g. on either the public shinyapps instance or a separate server. Having something that plays nicely with nf-core/rnaseq (and others potentially?) would be cool, but it is a lot of effort to maintain it properly! We have something along these lines internally, but a lot of effort went into the development and maintenance and it only works with a special pipeline...

Also not sure how we would host this for example rather than using existing services already available?

For a start, though, a nice MultiQC report would maybe suffice.

Yes! And adding some custom extra content would be super nice already 👍

ewels commented 4 years ago

Could we not just provide a docker container for a shiny app? But to be fair, it wouldn’t surprise me if someone else has made a tool like this somewhere already..

apeltzer commented 4 years ago

Quick search already revealed quite a number of them 😓

http://nasqar.abudhabi.nyu.edu/deseq2shiny/ https://yanli.shinyapps.io/DEApp/ https://schultzelab.shinyapps.io/Shiny-Seq/ http://shiny.imbei.uni-mainz.de:3838/ideal/ https://infinityloop.shinyapps.io/TCC-GUI/

chlazaris commented 4 years ago

Hi all,

we discussed several times in slack if it would be a good idea or not to add a differential expression analysis to the RNAseq pipeline. This issue would be to formalize a bit the discussion with pros and cons.

As for our experience at QBiC, we have this deseq2 pipeline as we needed to streamline this process as well. In no way I would directly add the whole pipeline as part of nf-core/RNAseq as it is very QBiC-specific, but I could adapt and polish the R script performing the differential expression analysis (and additionally pathway analysis, if there is interest) to be used as part of nf-core/RNAseq, and add its results to the MultiQC report.

Now discussion is open! @drpatelh @apeltzer @ewels

Totally agree on this @ggabernet. This would be a very nice addition @drpatelh @apeltzer @ewels. It would be nice to perform differential expression based on parameter values specified by the user. Downstream pathway analysis or GSEA would be very nice too. The user should definitely be aware and responsible for parameter values, inputs, comparisons made, etc. Thanks!

lconde-ucl commented 4 years ago

We have a very basic DGE+GSEA nextflow pipeline (https://github.com/UCL-BLIC/DGE) that we made with the idea of doing quick basic DGE analysis after running the nf-core rnaseq pipeline. It just needs two mandatory arguments:

`--inputdir`

This is used to specify the location of the outputdir from nf-core/rnaseq. You don't have to run nf-core/rnaseq necessarily, it just looks for the merged_gene_counts.txt file, so as long as you have an [inputdir]/featureCounts/merged_gene_counts.txt file, it will work.

`--metadata`

This should be a txt file where the first column are the sample IDs (matching the ones in the featureCounts file), and the other [1 or more] columns are the conditions for each sample, for example:

SampleID    Status  Levels
s1  ctr high
s2  ctr high
s3  ctr med
s4  case    low
s5  case    low
s6  case    low

By default, the pipeline will run deseq2 on each possible combination of conditions, using a design with all the conditions. For example, for the metadata.txt file above, the pipeline will run the following analysis:

Design: ~ Status + Levels
Comparisons:
    cases vs. controls (status)
    high vs. medium (levels)
    high vs. low (levels)
    medium vs. low (levels)

That's the default behaviour (running all possible combinations), but you can override this by using the --design, --condition, --treatment, and --control arguments, to set the specific design/comparisons that you are interested in.

For each comparison above, it will run a GSEA analysis using a geneset specified with the --gmt argument, or if not provided, it will use the Hallmarks from MSigDb

The output of the pipeline is described here: https://github.com/UCL-BLIC/DGE/blob/master/docs/output.md

This is a extremely simple pipeline for basic DGE, and when I made this I wasn't sure I would use it too much to be honest, because there is no one size fits all when it comes to doing DGE analyses, but anyway, I enjoyed the process of doing it when I started using nextflow, and maybe you find some of it useful (if not just ignore this comment :) )

Regards Lucia

ewels commented 4 years ago

Super nice @lconde-ucl! How does this compare to what you've done / what you have in mind @ggabernet / @drpatelh?

chlazaris commented 4 years ago

Thank you very much @lconde-ucl! It looks very useful. Which are the requirements (installed packages and configuration files) to run it on systems other than 'legion'? Thank you!

lconde-ucl commented 4 years ago

Hi @chlazaris yes, sorry it cannot be run straightaway unless you have access to our clusters (legion and myriad), but you can still use it if you create your own config file. You cna do this in different ways, one option is:

1- Add a profile in the nextflow.config file:

profiles {
  YOURPROFILE {
    includeConfig 'conf/YOURPROFILE.config'
  }
}

2- then create the 'conf/YOURPROFILE.config'.

You can take a look at the config files ones I have there as an example, for example: https://github.com/UCL-BLIC/DGE/blob/master/conf/legion.config

That config file is quite basic, the only thing to mention there is that we have a SGE cluster so I'm setting executor='sge', penv = 'smp', and all the packages that I need for the pipeline to run were available as modules in our cluster so I'm loading them using "beforeScript = 'source $HOME/.bashrc;module load blic-modules bioconda'" (obviously you will ned to change this, perhaps put all the necessary packages into a singularity image and point to the image using "containerPath") There you will also see a gmx param, which points to the hallmarks geneset, that I downloaded from MSigDB, which is used as default geneset for for GSEA analysis , and some tx2gene (transcript2gene) files for different assemblies, that I downloaded from iGenomes, that you will most likely don't need. I have them there because we have a modified version of the nf-core/rnaseq pipeline that runs kallisto (https://github.com/UCL-BLIC/rnaseq), so the DGE pipeline has an option to use transcript abundance files from kallisto, instead of the default gene counts from FeatureCounts, if desired. And if you use this option, you then need the txt2gene files.

Then you run it like this: nextflow run $path_to_DGE_pipeline -profile YOURPROFILE --inputdir inputdir --metadata metadata.txt --outputfile results

The pipeline itself is also quite simple, has only 2 processes. In one you run Deseq2 and in the other one you run GSEA.

The DESeq2 process simply calls the https://github.com/UCL-BLIC/DGE/blob/master/bin/run_deseq2.R R script, and to run it successfully you will have to have install the packages loaded there (DESeq2, general packages like dplyr, ggplot2, ini, ggrepel, png... rhdf5, tximport if using kallisto, and ReportingTools and hwriter for the output HTML report). This script runs Deseq2 and generates an HTML report with the results
The GSEA process calls two perl scripts, one that ranks the genes (-log of the p-val * sign of the FoldChange): https://github.com/UCL-BLIC/DGE/blob/master/bin/get_metric.pl, and one that runs GSEA and generates a summary plot of the results (https://github.com/UCL-BLIC/DGE/blob/master/bin/run_gsea.pl). To run this successfully you will need to download GSEAand change the path that points to the GSEA.jar file.

Sorry for the long comment, I think that the above covers the basics of what you would need to do to make it work outside our clusters. Please let me know if it's confusing, and sorry if it is, I didn't think anybody except us would use it, otherwise I would have tried to containerise everything (although I'm not sure you can do that with GSEA, as I believe that, like with GATK, you need to register to be able to download and use it.. :?)

I'm attaching some results that I got after using this pipeline with a test dataset and specifying --design "CD+Tregs" --condition "Tregs" --treatment "high" --control "ctr" --pval 1e-6 --fc 0.8, in case you want to look at them. The metadata for this dataset looks something like this:

Samples CD      Tregs
s1        CD4     low
s2      CD8     low
s3       CD4     high
s4       CD8     ctr
s5        CD4     high
[...]

output_userdefineddesign2.zip

ewels commented 4 years ago

@lconde-ucl - Side note, but the way we handle institutional configs is now quite a bit nicer, see https://nf-co.re/usage/configuration. I guess that this pipeline was built with a fairly old version of the nf-core template now, if you sync it with the latest version (nf-core sync) then you'll get all of this functionality for free. If it's set up this way then anyone can run the pipelines with any nf-core institutional config.

lconde-ucl commented 4 years ago

@ewels Yes, this was done quite a long time ago when I was starting to use nextflow, so probably looks a bit quite amateur :) I will check nf-core sync, thanks!

ggabernet commented 4 years ago

Hi Lucia @lconde-ucl,

thanks for sharing your pipeline, it looks very nice indeed and very close to what we want to achieve here, I think. I have some questions regarding the setup:

is it possible to give more than 2 conditions or just 1 condition? (so any number of conditions)
is it possible to give several levels per condition (even more than 2 or 3), and to provide contrasts including interaction terms?

If that is possible, then in my opinion it is quite similar to the one we have and would be indifferent which one to use as a starter.

The way to input the metadata for our pipeline is also via a metadata sheet (TSV file). The sample IDs need to match the IDs in the featureCount table from the RNAseq run so far, but can also think of something smarter there.

Here an example with 2 conditions but it supports any number of them:

SampleID  condition_treatment  condition_genotype
Sample1  treated  wt
Sample2 control  wt
Sample3  treated KO
Sample4 control KO

It handles samples with the same levels in each condition as replicates.

You provide the feature count table by --rawcounts 'path/to/raw_count_table.tsv'.

The linear model for modeling the gene expression can be specified with the --model param, e.g.:

~ condition_treatment + condition_genotype

Then by default all pairwise combinations for levels in each condition are used as contrasts. If you need more complex contrasts, the most straightforward way to provide them is via the --contrast_list param, which expects a TSV file:

factor  numerator denominator
condition_treatment treated control
condition_genotype  KO  WT

But it is also possible to input more complex contrasts via the --contrast_matrix param, in which interaction terms can also be provided in a TSV file. E.g.:

coefficient treatment_treated_vs_control  genotype_KO_vs_WT
intercept 0 0
condition_treatment  treated 1 0
condition_genotype  KO 0  1

LogFC thresholds and p-values can also be provided as parameter to determine thresholds for DEG. And there is even a --batch_effect option to be able to correct batch effects if needed.

The pathway analysis is performed with gprofileR2, and currently supports pathway analysis for mouse and human, which can be specified via --species.

The rest of the parms are QBiC-related things for our reporting (we generate a report departing from an Rmarkdown template), that I would, of course, trim out and adapt for MultiQC.

lconde-ucl commented 4 years ago

Hi gisela @ggabernet thanks, yes the pipelines are very similar, although I think that our pipeline is a bit simpler because it does not accept interactions at the moment. In answer to your questions:

1 - yes it accepts any number of conditions. By default, if the user does not specify a design, it will use a design with all the conditions (design = condition1 + condition2 + condition3 +...), and like yours, all pairwise combinations for levels in each condition are used as contrasts. I particularly would not use a design like that with more than 3 or 4 conditions, but in any case, yes, it's possible. And the the user is interested in just one specific design/contrast, it can also be specified it via the --design parameter

2- yes, it's also possible to have several levels per condition, but it does NOT support interactions because, in the code, it uses the original ("normal") DESeq2 shrinkage estimator (resNorm <- lfcShrink(dds, contrast = contrast, res=res, type="normal") which I believe does not work for designs with interactions. This could be changed (for example using a different shrinkage method if the user specifies a design with interactions), but at the moment it will not allow it.

I think you already have a great pipeline to start with, but if there is anything from this pipeline that you think might be useful and want to reuse, that would be great. I did this some time ago and didn't make much use of it, and I'm glad to see that something similar (and better!) might be coming soon!

Best Lucia

ggabernet commented 4 years ago

Sure, thanks for sharing your workflow and all the information, we can certainly mix up the best of both pipelines then, and it's great to see a second implementation!

We stay in contact, then, regarding this pipeline :)

drpatelh commented 4 years ago

That mini-workflow does look really cool @lconde-ucl 😎 I am cc'ing in @macroscian here who is our in-house stats guru. He has written a similar script for use across our group with the idea of trying to generalise the provision of designs in order to perform analysis with DESeq2. Be really cool if we can come up with a consensus to do this sort of analysis and to generate some nice documentation and assessments downstream too.

Stefan in our group has been working on a web-based exploratory tool for differential analysis for a while now too. Cant find his GitHub handle but can ask him to comment here too.

gavin-kelly-1 commented 4 years ago

Just a quick comment. At the Crick we've developed a fairly concise way of specifying models and contrasts that we'd like to run on an expression dataset. For one-way designs there's little harm in doing all pairwise comparisons by default (but even there, I'd want to be able to do an LRT test across the levels as well,...), but we mainly see more complex designs than that, at which point it's hard to get away from some human specification of the hypotheses,.

For example, in a 2x3 design we've get asked for any of:

All pairs within the '3' factor, stratified for each level of '2' (or vice versa)
All pairs within the '3' factor, adjusting for '2' (ditto)
All pairs within the '3' factor, ignoring the '2' (ditto)
All interactions
Subset of interactions, against a baseline of '3'
Likelihood ratio test of things that are changing within the '3', for each level of '2' (ditto)
LRT of the interaction,
...

If we try automate the 'do everything' it gets combinatorially confusing for the scientist; and if we try tos econd-guess what is the 'obvious' thing we find that the scientist will think this is the only thing possible. Which is why I've basically abstracted out the 'design' and 'reduced' part of the DESeq fitting, and the 'contrast' argument of the results, so we can choose as few or as many of the above hypotheses in our analysis (ideally when the experiment is being designed).

I've mainly been doing this with a skeleton script that contains boilerplate for the different options above, that can be commented-out or expanded as required, but it wouldn't be any work to add the model/contrast specification to be another dependency of the pipeline. The reason we keep it separate is that even if we pre-register an analysis plan, quite often we adapt it to take account of unforeseen circumstances.

The 'API' that abstracts the DESeq functionality is an R list - it's probably possible to put it into a more markup-language format, and parse it into R, but for the moment an example is

mdlList <- list(
  "Standard" = list(
    design = ~ treatment,
    comparison = list(
      "DPN" = c("treatment","DPN","Control"),
      "OHT" = c("treatment","OHT","Control")
      )
  ),
  "Patient Adjusted" = list(
    design = ~ patient + treatment, 
    comparison = list(
      "Treated" = list(c("treatment_DPN_vs_Control", "treatment_OHT_vs_Control"), listValues=c(.5, -1)),
      "DPN" = c("treatment","DPN","Control"),
      "OHT" = c("treatment","OHT","Control"),
      "Patient" = ~treatment
    )
  )
)

So, yes it still demands an understanding of contrasts, model-formulae etc, but without wanting to sound to gate-keeper-ish, I think for any experiment beyond a one-way design this is always going to be the case.

chlazaris commented 4 years ago

Hello everyone. How could I possibly track the progress of adding this functionality to the existing RNA-seq pipeline? Do you have any timeline, estimate for integrating this feature(s)? Thank you @ewels @ggabernet @lconde-ucl @drpatelh

ewels commented 4 years ago

@chlazaris - from the above discussion we decided to make this into a new pipeline, so it will never be added to the existing RNA-seq pipeline. The timeline for the new pipeline is very difficult to estimate as it depends entirely on who has time to build it and when. But I suspect that it will be quite a while yet.

Phil

chlazaris commented 4 years ago

Thank you @ewels. Please, keep me posted on this one. It will be a very helpful addition. Thank you! @drpatelh @ewels @ggabernet @lconde-ucl

drpatelh commented 3 years ago

Given that we have decided to create a stand-alone workflow for this issue and not to incorporate it directly within nf-core/rnaseq, I have closed it for now. However, I have pinned it in the issue section so it is findable for those that are looking for this functionality.

nf-core / rnaseq

Add deseq2 differential expression analysis #409

`--inputdir`

`--metadata`