Normalization creates outliers

rgcgithub / clamms

CLAMMS is a scalable tool for detecting common and rare copy number variants from whole-exome sequencing data.

Other

29 stars 10 forks source link

Normalization creates outliers #12

Closed haraldgrove closed 6 years ago

haraldgrove commented 6 years ago

Hi, I'm trying to analyze 16 exome-samples and before normalization they all form an evenly distributed cluster after performing PCA on the .coverage.bed files. However, doing the same PCA after normalization (.norm.cov.bed) shows that two samples have become clear outliers. As an example for one region, one sample went from a coverage of 44 before normalization to -2723 after. Do you have any idea what might have happened or what I could try and look for?

Best regards, Harald

rgcgithub commented 6 years ago

It's hard to say without seeing your data, and while I don't know exactly what you're comparing, you shouldn't expect raw PCs to be comparable when you apply PCA to raw and normalized coverage files. That said, PCA is not really a part of CLAMMS. You can use it to identify obvious batch effects in smaller cohorts, but yours is only 16 samples. I would just train one reference model with all samples and call CNVs against it. If those two outliers have inflated CNV call statistics, you may try retraining with those two excluded. But with only 16 samples, you don't have much flexibility.

haraldgrove commented 6 years ago

Thank you for the response. Unfortunately (for me), I just realised that only one of the outliers from the PCA plot actually match the two samples with very high CNV calls. Probably that means that the normalization is fine and it's just a lack of samples that makes those two appear to have more CNV areas. Sorry for jumping to conclusions.

Best, -Harald

rgcgithub commented 6 years ago

If you have 2 samples that are outliers in coverage space (i.e. as determined by your PCA), but you're getting different outlier samples in terms of having high CNV call rates, I would try building your test panel model with the coverage outliers excluded. I would guess that your cohort sample size is not large enough to mitigate the effect that those two outlier samples have on your coverage distributions. You might also consider building a custom model for each sample where you also leave the test sample out of the model training set because, again, your sample size is small enough that I might worry about the test sample biasing your reference distributions.