Question regarding CCF output interpretation

maxanes commented 2 years ago

Hi, I am looking at CCFs of mutations and today I got a bit of strange output for mutations in this sample where purity is 0.84, ploidy 3.49 estimated by PureCN. The mean coverage of the normal sample is 79, and the tumor sample is 251. I have attached screenshoots from pdf.result file as well. Does it mean that this sample is contaminated, because it also has 0,01 contamination detected by PureCN? Thank you.

Sampleid | chr | start | end | ID | REF | ALT | SOMATIC.M0 | SOMATIC.M1 | SOMATIC.M2 | SOMATIC.M3 | SOMATIC.M4 | SOMATIC.M5 | SOMATIC.M6 | SOMATIC.M7 | GERMLINE.M0 | GERMLINE.M1 | GERMLINE.M2 | GERMLINE.M3 | GERMLINE.M4 | GERMLINE.M5 | GERMLINE.M6 | GERMLINE.M7 | GERMLINE.CONTHIGH | GERMLINE.CONTLOW | GERMLINE.HOMOZYGOUS | ML.SOMATIC | POSTERIOR.SOMATIC | ML.M | ML.C | ML.M.SEGMENT | M.SEGMENT.POSTERIOR | M.SEGMENT.FLAGGED | ML.AR | AR | AR.ADJUSTED | MAPPING.BIAS | ML.LOH | CN.SUBCLONAL | CELLFRACTION | CELLFRACTION.95.LOWER | CELLFRACTION.95.UPPER | ALLELIC.IMBALANCE | FLAGGED | log.ratio | depth | prior.somatic | prior.contamination | on.target | seg.id | pon.count | gene.symbol -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- C32T09757D_ffpe | chr5 | 1,78E+08 | 1,78E+08 | chr5:177554468_C/T | C | T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | TRUE | 0 | 1 | 3 | 0 | 1 | FALSE | 0,295775 | 1 | 1 | 0,992125 | FALSE | FALSE | 1 | 0,97 | 1 | -35,5108 | TRUE | -0,06416 | 81 | 0,000099 | 0,01 | 2 | 159 | 0 |

lima1 commented 2 years ago

No, this sample looks fine (apart from the obvious over-segmentation - did you ever try PureCN.R --fun-segmentation GATK4?) and both purity and ploidy are correct.

PureCN will always test for contamination and by default sets it at 0.01 in the likelihood model. That's why "prior.contamination" is 0.01. "posterior" are the probability predictions of the model, "prior" are model parameters. And only known SNPs are tested for being cross-sample contaminated. There are always a few SNPs assigned to this contamination state, simply because that's by chance or uncorrected bias the best fitting state, but that's not enough to call a sample contaminated.

A truly contaminated sample has bands of SNPs close to 0 and close to 1 (SNPs only present in the contaminated sample and homozygous SNPs with a few reference alleles from the contamination, respectively). PureCN should be pretty good at flagging contaminated samples in Sampleid.csv - unless the contamination is pretty bad, like > 7% or so, then PureCN gets thoroughly confused. If lots of samples are wrongly labelled contaminated, then the VCF contains a lot of unlabelled artifacts.

maxanes commented 2 years ago

Thank you for your quick response. I haven't tried --fun-segmentation GATK4 yet but using --fun-segmentation PSCBS. But it confuses me in the example of mutation above that it says 'true' for ML.somatic mutation and that posterior prob. is 0, though prior prob that is somatic is also low. Not sure why is that wrong maybe due to low coverage at that place only 80 or something else can cause that errors?

lima1 commented 2 years ago

This variant is flagged, meaning no state really fit well. Can happen for various reasons, but potentially in a noisy region where the segmentation was wrong. I would ignore all "flagged" variants unless they are of interest, in that case dig deeper why they are flagged.

lima1 / PureCN

Question regarding CCF output interpretation #223