mhorlbeck / ScreenProcessing

64 stars 32 forks source link

Comparison to MaGeck #26

Open keenhl opened 1 year ago

keenhl commented 1 year ago

Thanks for making this tool available. It was recommended to me by a colleague. I'm new to this kind of analysis and was just curious about the advantages/disadvantages of this tool compared to the MaGeck software.

Thanks for the help.

mhorlbeck commented 1 year ago

Good question. At least compared to the original version of MAGeCK (there have been a few updates and I haven't kept up to date), I'd say there are three differences:

  1. sgRNA counts -> phenotype scores: The approach MAGeCK uses is more sophisticated but may be harder to directly interpret. It uses modeling of dispersion to correct for noise at lowly-represented sgRNAs, similar to DESeq if you're familiar with RNA-seq analysis, whereas this pipeline just measures log2 fold-enrichment without correction and applies a counts threshold to exclude very lowly-represented sgRNAs.

  2. sgRNA-level -> gene-level phenotypes: This pipeline uses two partially orthogonal metrics to score genes based on the sgRNAs targeting that gene:

    • it performs a Mann-Whitney p-value, which reports the chance a particular set of sgRNAs could be randomly sampled from the negative controls. It is non-parametric (relies only on guide ranking), so one strong outlier sgRNA will not have a significant effect on the p-value.
    • it uses the average of the top 3 sgRNAs by absolute value (by default, can be adjusted in the settings). This provides an estimate of the actual effect size of the gene, implicitly assuming that the sgRNAs below the top 3 are less effective at repression/activation/cutting. This results in a volcano plot that reveals strong effect+significant hits, weak effect but significant, and strong effect but marginally significant (i.e. you'll want to look at those manually to see if those are driven by just 1-2 active sgRNAs or by low counts and noise).

MAGeCK uses just a rank-based p-value, which in my opinion is less interpretable. But there is no reason you couldn't take the MAGeCK results from sgRNA counts->phenotypes and apply whatever statistical tests you prefer to get gene scores.

  1. Related to 2, gene-level scoring in MAGeCK is done agnostic of negative controls. There's a lot that has been debated and written about controls in CRISPR nuclease/i/a screening libraries, but I think it is very important not to assume that the median gene has no phenotype, because some screens (like essential gene screens) can have a very skewed distribution and would cause negative controls and genes with no phenotype to appear enriched relative to essential genes. The Kampmann lab has a version of MAGeCK that fixes this: https://kampmannlab.ucsf.edu/mageck-inc

You can certainly try both and see what is easiest to implement (my pipeline may not be the most user-friendly) and what gives you results that are interpretable and can be functionally validated.