nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
919 stars 708 forks source link

PCA in deseq2_qc centers but does not scale #687

Closed dangeles closed 3 years ago

dangeles commented 3 years ago

Hi!

First of all, thanks to all the developers and the community for this really beautiful pipeline. I am really impressed with how well the pipeline functions, and especially with the modularization that has recently taken place.

I was browsing the deseq2_qc.r script, and I noticed that in line 136, the pca decomposition is called with center = TRUE and scale = FALSE. In the past, I have always used scale = TRUE, and I was wondering what the rationale is for not scaling? I was wondering if plotting PCA using centering but not scaling helps reveal some pathologies among samples that scaled PCA does not?

I guess the feature request would be to title PCA plot output as centered, non-scaled PCA or something informative like that. I found out the PCA was not scaled after comparing my PCA plots to the plots from the pipeline output, and was confused by a bit until I found the script PCA call. It would also be reasonable to make the scale and center into nextflow parameters so users can specify their PCA at will.

Again, thanks all for this great pipeline. All the scripts are really well written, really easy to troubleshoot, and an enormous service to the community.

Best, David

drpatelh commented 3 years ago

Hi @dangeles ! Thank you for the kinds words and now becoming a contributing member of the community! 😎

I will defer this question to @macroscian who wrote this script for us and will defo be able to come up with a more informed answer.

gavin-kelly-1 commented 3 years ago

Two reasons: primarily to maintain concordance with DESeq2::plotPCA which also goes with hard-wired center=TRUE, scale=FALSE. But also personal preference - the usual rule of thumb for PCA is that if the features are on a commensurate scale (e.g. same units of measurement) then scaling might disrupt that natural correspondence.

Seq assays are a bit of an odd case here, in that we're measurement the same sort of thing on different entities, so there's a case to be made either way, but I tend to find that scaling over-emphasises the lowly expressed genes (noise?). We've also gone with not filtering on variability (in addition to taking the traditional top 500 variable genes), which gives lowly expressed (which are typically the low variance) genes the possibility of reasserting themselves in the PC weighting if enough of them are acting in a correlated way.

It might well be useful to make the plot legends/documentation more explicit on what's behind the plots; I hope people aren't reading too much into the PCA's though, as my belief is that they really should be done as part of the more 'interactive' analysis where the analyst is making decisions such as these informed by the research questions being asked. Personally, I'd rather encourage people to ignore the PCAs and instead generate their own, rather than try to encapsulate a set of further options into the nf-core pipeline (and I think the general mood is not to increase options more than is strictly necessary?) But that is my very opinionated perspective, and happy to hear other opinions.

drpatelh commented 3 years ago

Thanks @macroscian ! Yes, I agree that I am actively trying to minimise the addition of parameters if possible. Mainly because we already have quite a few options for other, core components of the pipeline. The PCA generated at the end of the pipeline is more of a perk and should be taken with a pinch of salt - we tried to emphasise this in bold in the docs

Hopefully, this answer's your question @dangeles ?

drpatelh commented 3 years ago

Ok. Will close this now assuming all questions have been answered :) Feel free to re-open if required.

dangeles commented 3 years ago

thanks for the thorough answer! that was helpful!