Closed ggabernet closed 3 years ago
Hooray! I can finally resurrect the issue I started about this topic in 2016 😆 https://github.com/SciLifeLab/NGI-RNAseq/issues/31
Providing at least a very basic version of this is not a problem - of course you can do things incorrectly and some people might be doing things incorrectly then (e.g. not using appropriate assumptions on groupings etc pp), but that will happen to them either way.
If we put a "big fat" disclaimer on top of the generated results / report / output, that this is automatically generated data and should be taken with some critical thinking but could e.g. serve as a start - why not add it?
Regarding the pipeline you mentioned: I think generalizing this should be feasible to do without having to entirely redo the entire thing - at least renaming certain bits here and there wouldn't require as much effort as restarting something from scratch?
I only see the con point that it might be used incorrectly, then blaming us here - but we can counter that with proper statements / docs and referring everyone to the downstream tools.
+1 for the pathways, potentially even GSEA ?
My conclusion years ago was that the safest method was to provide a script instead of results. This could be as a Rmarkdown / jupyter notebook etc, with lots of built-in comments and annotation. The pipeline could auto-fill placeholders in the script, such as filenames and some metadata. But it would require the end-user to actually view the code and therefore take more responsibility for the results. DE analysis is also a lot less likely to be routine than the rest of the pipeline, so providing a "starter" script is perhaps more useful in some cases.
The downside is that people have to be able to run the given script, which adds complexity for novice users. But this is kind of by design 😄
Providing an Rmarkdown / jupyter notebook seems like a good idea for the end user. For a Bioinformatics facility like us, we need to provide end results, and it is nice to have it run together with the pipeline. I guess we can just continue using our pipeline for this purpose, though.
One disadvantage I see for the notebooks is the dependency problem, but I guess we can show how to create a conda environment from the environment.yml
file, just the R/python
packages should then also be added there.
Yes - the above comes with the caveat that I last thought about this years ago when it was only us using the pipeline. So there were other considerations, like the fact that we don't routinely collect the metadata required to do the analysis. I'm not saying that it's the best way to go, just noting down my previous thoughts.
What do you think about having this as a second, separate pipeline? Then we avoid adding more complexity to the rnaseq pipeline and it keeps its simple input requirements. This new pipeline could handle the extra metadata required with a sample sheet etc. Further down the line when everything moves to DSL 2 we can always revisit if necessary.
I think it would be fairly straightforward to have a "simple" script to do the differential analysis. We could just make the grouping assumptions based on the input design file as I have been doing with the atacseq
pipeline:
https://github.com/nf-core/atacseq/blob/master/docs/usage.md#--input
Note this also caters for samples re-sequenced multiple times. The information we can get from this will be enough for most differential analyses which in this case uses this script: https://github.com/nf-core/atacseq/blob/master/bin/featurecounts_deseq2.r
The key with the design input here is to have complete control over the input data and how it is named within the pipeline. That way it makes these sorts of analysis plausible and the code legible 😅
Agree with the comments about having a big warning with this sort of analysis though: https://github.com/nf-core/atacseq/blob/master/docs/output.md
"This pipeline uses a standardised DESeq2 analysis script to get an idea of the reproducibility within the experiment, and to assess the overall differential binding. Please note that this will not suit every experimental design, and if there are other problems with the experiment then it may not work as well as expected."
Probably not big enough 🤷♂️ Also agree with adding pathway analysis, GSEA and maybe even having a Shiny app to load up the results to work with them interactively? and/or to generate fully documented Rmarkdown
reports and/or to use Jupyter.
PS: The strategy above has also been working quite well for the chipseq
pipeline.
It sounds good to me to start it as a separate pipeline for now, and adding it as a sub-workflow to RNAseq in DSL2.
I agree with @drpatelh that it should be kept rather simple by default.
A shiny app to interactively analyze the results is also a good idea. I know there were similar initiatives at QBiC as well, but it was decided that it required too much maintenance from our side. Would you see this as part of nf-core as well?
For a start, though, a nice MultiQC report would maybe suffice.
It sounds good to me to start it as a separate pipeline for now, and adding it as a sub-workflow to RNAseq in DSL2.
Sounds reasonable to me 👍
A shiny app to interactively analyze the results is also a good idea. I know there were similar initiatives at QBiC as well, but it was decided that it required too much maintenance from our side. Would you see this as part of nf-core as well?
Not really - Shiny Apps must be hosted somewhere, e.g. on either the public shinyapps instance or a separate server. Having something that plays nicely with nf-core/rnaseq (and others potentially?) would be cool, but it is a lot of effort to maintain it properly! We have something along these lines internally, but a lot of effort went into the development and maintenance and it only works with a special pipeline...
Also not sure how we would host this for example rather than using existing services already available?
For a start, though, a nice MultiQC report would maybe suffice.
Yes! And adding some custom extra content would be super nice already 👍
Could we not just provide a docker container for a shiny app? But to be fair, it wouldn’t surprise me if someone else has made a tool like this somewhere already..
Quick search already revealed quite a number of them 😓
http://nasqar.abudhabi.nyu.edu/deseq2shiny/ https://yanli.shinyapps.io/DEApp/ https://schultzelab.shinyapps.io/Shiny-Seq/ http://shiny.imbei.uni-mainz.de:3838/ideal/ https://infinityloop.shinyapps.io/TCC-GUI/
Hi all,
we discussed several times in slack if it would be a good idea or not to add a differential expression analysis to the RNAseq pipeline. This issue would be to formalize a bit the discussion with pros and cons.
As for our experience at QBiC, we have this deseq2 pipeline as we needed to streamline this process as well. In no way I would directly add the whole pipeline as part of
nf-core/RNAseq
as it is very QBiC-specific, but I could adapt and polish the R script performing the differential expression analysis (and additionally pathway analysis, if there is interest) to be used as part ofnf-core/RNAseq
, and add its results to the MultiQC report.Now discussion is open! @drpatelh @apeltzer @ewels
Totally agree on this @ggabernet. This would be a very nice addition @drpatelh @apeltzer @ewels. It would be nice to perform differential expression based on parameter values specified by the user. Downstream pathway analysis or GSEA would be very nice too. The user should definitely be aware and responsible for parameter values, inputs, comparisons made, etc. Thanks!
We have a very basic DGE+GSEA nextflow pipeline (https://github.com/UCL-BLIC/DGE) that we made with the idea of doing quick basic DGE analysis after running the nf-core rnaseq pipeline. It just needs two mandatory arguments:
--inputdir
This is used to specify the location of the outputdir
from nf-core/rnaseq
. You don't have to run nf-core/rnaseq necessarily, it just looks for the merged_gene_counts.txt
file, so as long as you have an [inputdir]/featureCounts/merged_gene_counts.txt
file, it will work.
--metadata
This should be a txt file where the first column are the sample IDs (matching the ones in the featureCounts file), and the other [1 or more] columns are the conditions for each sample, for example:
SampleID Status Levels
s1 ctr high
s2 ctr high
s3 ctr med
s4 case low
s5 case low
s6 case low
By default, the pipeline will run deseq2 on each possible combination of conditions, using a design with all the conditions. For example, for the metadata.txt
file above, the pipeline will run the following analysis:
Design: ~ Status + Levels
Comparisons:
cases vs. controls (status)
high vs. medium (levels)
high vs. low (levels)
medium vs. low (levels)
That's the default behaviour (running all possible combinations), but you can override this by using the --design
, --condition
, --treatment
, and --control
arguments, to set the specific design/comparisons that you are interested in.
For each comparison above, it will run a GSEA analysis using a geneset specified with the --gmt
argument, or if not provided, it will use the Hallmarks from MSigDb
The output of the pipeline is described here: https://github.com/UCL-BLIC/DGE/blob/master/docs/output.md
This is a extremely simple pipeline for basic DGE, and when I made this I wasn't sure I would use it too much to be honest, because there is no one size fits all when it comes to doing DGE analyses, but anyway, I enjoyed the process of doing it when I started using nextflow, and maybe you find some of it useful (if not just ignore this comment :) )
Regards Lucia
Super nice @lconde-ucl! How does this compare to what you've done / what you have in mind @ggabernet / @drpatelh?
Thank you very much @lconde-ucl! It looks very useful. Which are the requirements (installed packages and configuration files) to run it on systems other than 'legion'? Thank you!
Hi @chlazaris yes, sorry it cannot be run straightaway unless you have access to our clusters (legion and myriad), but you can still use it if you create your own config file. You cna do this in different ways, one option is:
1- Add a profile in the nextflow.config file:
profiles {
YOURPROFILE {
includeConfig 'conf/YOURPROFILE.config'
}
}
2- then create the 'conf/YOURPROFILE.config'.
You can take a look at the config files ones I have there as an example, for example: https://github.com/UCL-BLIC/DGE/blob/master/conf/legion.config
That config file is quite basic, the only thing to mention there is that we have a SGE cluster so I'm setting executor='sge', penv = 'smp', and all the packages that I need for the pipeline to run were available as modules in our cluster so I'm loading them using "beforeScript = 'source $HOME/.bashrc;module load blic-modules bioconda'" (obviously you will ned to change this, perhaps put all the necessary packages into a singularity image and point to the image using "containerPath") There you will also see a gmx param, which points to the hallmarks geneset, that I downloaded from MSigDB, which is used as default geneset for for GSEA analysis , and some tx2gene (transcript2gene) files for different assemblies, that I downloaded from iGenomes, that you will most likely don't need. I have them there because we have a modified version of the nf-core/rnaseq pipeline that runs kallisto (https://github.com/UCL-BLIC/rnaseq), so the DGE pipeline has an option to use transcript abundance files from kallisto, instead of the default gene counts from FeatureCounts, if desired. And if you use this option, you then need the txt2gene files.
Then you run it like this:
nextflow run $path_to_DGE_pipeline -profile YOURPROFILE --inputdir inputdir --metadata metadata.txt --outputfile results
The pipeline itself is also quite simple, has only 2 processes. In one you run Deseq2 and in the other one you run GSEA.
The DESeq2 process simply calls the https://github.com/UCL-BLIC/DGE/blob/master/bin/run_deseq2.R R script, and to run it successfully you will have to have install the packages loaded there (DESeq2, general packages like dplyr, ggplot2, ini, ggrepel, png... rhdf5, tximport if using kallisto, and ReportingTools and hwriter for the output HTML report). This script runs Deseq2 and generates an HTML report with the results
The GSEA process calls two perl scripts, one that ranks the genes (-log of the p-val * sign of the FoldChange): https://github.com/UCL-BLIC/DGE/blob/master/bin/get_metric.pl, and one that runs GSEA and generates a summary plot of the results (https://github.com/UCL-BLIC/DGE/blob/master/bin/run_gsea.pl). To run this successfully you will need to download GSEAand change the path that points to the GSEA.jar file.
Sorry for the long comment, I think that the above covers the basics of what you would need to do to make it work outside our clusters. Please let me know if it's confusing, and sorry if it is, I didn't think anybody except us would use it, otherwise I would have tried to containerise everything (although I'm not sure you can do that with GSEA, as I believe that, like with GATK, you need to register to be able to download and use it.. :?)
I'm attaching some results that I got after using this pipeline with a test dataset and specifying --design "CD+Tregs" --condition "Tregs" --treatment "high" --control "ctr" --pval 1e-6 --fc 0.8, in case you want to look at them. The metadata for this dataset looks something like this:
Samples CD Tregs
s1 CD4 low
s2 CD8 low
s3 CD4 high
s4 CD8 ctr
s5 CD4 high
[...]
@lconde-ucl - Side note, but the way we handle institutional configs is now quite a bit nicer, see https://nf-co.re/usage/configuration. I guess that this pipeline was built with a fairly old version of the nf-core template now, if you sync it with the latest version (nf-core sync
) then you'll get all of this functionality for free. If it's set up this way then anyone can run the pipelines with any nf-core institutional config.
@ewels Yes, this was done quite a long time ago when I was starting to use nextflow, so probably looks a bit quite amateur :) I will check nf-core sync, thanks!
Hi Lucia @lconde-ucl,
thanks for sharing your pipeline, it looks very nice indeed and very close to what we want to achieve here, I think. I have some questions regarding the setup:
If that is possible, then in my opinion it is quite similar to the one we have and would be indifferent which one to use as a starter.
The way to input the metadata for our pipeline is also via a metadata sheet (TSV file). The sample IDs need to match the IDs in the featureCount table from the RNAseq run so far, but can also think of something smarter there.
Here an example with 2 conditions but it supports any number of them:
SampleID condition_treatment condition_genotype
Sample1 treated wt
Sample2 control wt
Sample3 treated KO
Sample4 control KO
It handles samples with the same levels in each condition as replicates.
You provide the feature count table by --rawcounts 'path/to/raw_count_table.tsv'
.
The linear model for modeling the gene expression can be specified with the --model
param, e.g.:
~ condition_treatment + condition_genotype
Then by default all pairwise combinations for levels in each condition are used as contrasts. If you need more complex contrasts, the most straightforward way to provide them is via the --contrast_list
param, which expects a TSV file:
factor numerator denominator
condition_treatment treated control
condition_genotype KO WT
But it is also possible to input more complex contrasts via the --contrast_matrix
param, in which interaction terms can also be provided in a TSV file. E.g.:
coefficient treatment_treated_vs_control genotype_KO_vs_WT
intercept 0 0
condition_treatment treated 1 0
condition_genotype KO 0 1
LogFC thresholds and p-values can also be provided as parameter to determine thresholds for DEG. And there is even a --batch_effect
option to be able to correct batch effects if needed.
The pathway analysis is performed with gprofileR2
, and currently supports pathway analysis for mouse and human, which can be specified via --species
.
The rest of the parms are QBiC-related things for our reporting (we generate a report departing from an Rmarkdown template), that I would, of course, trim out and adapt for MultiQC.
Hi gisela @ggabernet thanks, yes the pipelines are very similar, although I think that our pipeline is a bit simpler because it does not accept interactions at the moment. In answer to your questions:
1 - yes it accepts any number of conditions. By default, if the user does not specify a design, it will use a design with all the conditions (design = condition1 + condition2 + condition3 +...), and like yours, all pairwise combinations for levels in each condition are used as contrasts. I particularly would not use a design like that with more than 3 or 4 conditions, but in any case, yes, it's possible. And the the user is interested in just one specific design/contrast, it can also be specified it via the --design parameter
2- yes, it's also possible to have several levels per condition, but it does NOT support interactions because, in the code, it uses the original ("normal") DESeq2 shrinkage estimator (resNorm <- lfcShrink(dds, contrast = contrast, res=res, type="normal"
) which I believe does not work for designs with interactions. This could be changed (for example using a different shrinkage method if the user specifies a design with interactions), but at the moment it will not allow it.
I think you already have a great pipeline to start with, but if there is anything from this pipeline that you think might be useful and want to reuse, that would be great. I did this some time ago and didn't make much use of it, and I'm glad to see that something similar (and better!) might be coming soon!
Best Lucia
Sure, thanks for sharing your workflow and all the information, we can certainly mix up the best of both pipelines then, and it's great to see a second implementation!
We stay in contact, then, regarding this pipeline :)
That mini-workflow does look really cool @lconde-ucl 😎 I am cc'ing in @macroscian here who is our in-house stats guru. He has written a similar script for use across our group with the idea of trying to generalise the provision of designs in order to perform analysis with DESeq2. Be really cool if we can come up with a consensus to do this sort of analysis and to generate some nice documentation and assessments downstream too.
Stefan in our group has been working on a web-based exploratory tool for differential analysis for a while now too. Cant find his GitHub handle but can ask him to comment here too.
Just a quick comment. At the Crick we've developed a fairly concise way of specifying models and contrasts that we'd like to run on an expression dataset. For one-way designs there's little harm in doing all pairwise comparisons by default (but even there, I'd want to be able to do an LRT test across the levels as well,...), but we mainly see more complex designs than that, at which point it's hard to get away from some human specification of the hypotheses,.
For example, in a 2x3 design we've get asked for any of:
If we try automate the 'do everything' it gets combinatorially confusing for the scientist; and if we try tos econd-guess what is the 'obvious' thing we find that the scientist will think this is the only thing possible. Which is why I've basically abstracted out the 'design' and 'reduced' part of the DESeq fitting, and the 'contrast' argument of the results, so we can choose as few or as many of the above hypotheses in our analysis (ideally when the experiment is being designed).
I've mainly been doing this with a skeleton script that contains boilerplate for the different options above, that can be commented-out or expanded as required, but it wouldn't be any work to add the model/contrast specification to be another dependency of the pipeline. The reason we keep it separate is that even if we pre-register an analysis plan, quite often we adapt it to take account of unforeseen circumstances.
The 'API' that abstracts the DESeq functionality is an R list - it's probably possible to put it into a more markup-language format, and parse it into R, but for the moment an example is
mdlList <- list(
"Standard" = list(
design = ~ treatment,
comparison = list(
"DPN" = c("treatment","DPN","Control"),
"OHT" = c("treatment","OHT","Control")
)
),
"Patient Adjusted" = list(
design = ~ patient + treatment,
comparison = list(
"Treated" = list(c("treatment_DPN_vs_Control", "treatment_OHT_vs_Control"), listValues=c(.5, -1)),
"DPN" = c("treatment","DPN","Control"),
"OHT" = c("treatment","OHT","Control"),
"Patient" = ~treatment
)
)
)
So, yes it still demands an understanding of contrasts, model-formulae etc, but without wanting to sound to gate-keeper-ish, I think for any experiment beyond a one-way design this is always going to be the case.
Hello everyone. How could I possibly track the progress of adding this functionality to the existing RNA-seq pipeline? Do you have any timeline, estimate for integrating this feature(s)? Thank you @ewels @ggabernet @lconde-ucl @drpatelh
@chlazaris - from the above discussion we decided to make this into a new pipeline, so it will never be added to the existing RNA-seq pipeline. The timeline for the new pipeline is very difficult to estimate as it depends entirely on who has time to build it and when. But I suspect that it will be quite a while yet.
Phil
Thank you @ewels. Please, keep me posted on this one. It will be a very helpful addition. Thank you! @drpatelh @ewels @ggabernet @lconde-ucl
Given that we have decided to create a stand-alone workflow for this issue and not to incorporate it directly within nf-core/rnaseq, I have closed it for now. However, I have pinned it in the issue section so it is findable for those that are looking for this functionality.
Hi all,
we discussed several times in slack if it would be a good idea or not to add a differential expression analysis to the RNAseq pipeline. This issue would be to formalize a bit the discussion with pros and cons.
As for our experience at QBiC, we have this deseq2 pipeline as we needed to streamline this process as well. In no way I would directly add the whole pipeline as part of
nf-core/RNAseq
as it is very QBiC-specific, but I could adapt and polish the R script performing the differential expression analysis (and additionally pathway analysis, if there is interest) to be used as part ofnf-core/RNAseq
, and add its results to the MultiQC report.Now discussion is open! @drpatelh @apeltzer @ewels