hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
181 stars 56 forks source link

PURPLE_AF - how is it calculated? #394

Closed maia-munteanu closed 1 year ago

maia-munteanu commented 1 year ago

Hello! I'm having some trouble understanding what PURPLE_AF actually represents and how it is calculated from AF. Working on a PCAWG sample, I extracted both the AF and the PURPLE_AF fraction from the VCF, alongside the PURPLE outputted purity of the sample. I assumed that PURPLE_AF is calculated as AF/Purity (as predicted by PURPLE), with no CN information incorporated in the formula. However, I noticed that when dividing all mutation AFs by their PURPLE_AFs, I obtain variable purity values, see plot below. Most values are around the 0.64 predicted purity level, but many are not.

(ordered from lowest to highest purity value) image

(ordered by position along the genome) image

Given this pattern of purity values, I suspect that some CN information is somehow incorporated in the formula used for PURPLE_AF, but I wasn't able to find any more details in the documentation. Would you be able to provide me with more details about this purity adjustment step?

Many thanks, Maia

p-priestley commented 1 year ago

Yes it is adjusted based on the purity and local copy number of the variant.

This is done in this step: https://github.com/hartwigmedical/hmftools/tree/master/purple#10-somatic-enrichment

maia-munteanu commented 1 year ago

Thanks a lot for getting back to me so quickly. Would you be able to tell me exactly how it is calculated from the purity and CN info? Is it something like: VAF/purity (CNpurity + 2 *(1-purity)) ? Are there any other variables used in this calculation, e.g. multiplicity?

Thanks, Maia

p-priestley commented 1 year ago

I believe the formula is: VAF (CNpurity + 2 *(1-purity)) / CN / purity

maia-munteanu commented 1 year ago

Thanks a lot! Just to confirm, the formula is:

Formula

Where purity is 1 value per sample outputted by PURPLE found in the .purple.purity.tsv file, VAF is the AF found in the VCF for each mutation and CN is PURPLE_CN also given for each mutation in the VCF. Did I get that right?

Thanks, Maia

p-priestley commented 1 year ago

That looks right, yes