hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
181 stars 56 forks source link

Purity Interpretation #396

Closed DarioS closed 1 year ago

DarioS commented 1 year ago

Recently, a couple of algorithms CopyKAT and SCEVAN for analysing single-cell RNA sequencing data have identified that some tumours are composed of two or three main clones.

In one TNBC sample (TNBC1), the clustering of 797 aneuploid copy number profiles identified two major subclones (A, B) that comprised 44% and 28% of the tumor mass and were separated by two distinct lineages in neighbor-joining (NJ) tree. Clustered heat maps identified clonal amplifications (1q, 6p, 8q, 10p, 16p and 18p) and clonal deletions (1p, 4q, 5q, 8p, 10q, 13 and 14) that were shared across all tumor cells. The clustered heat maps of the consensus copy number profiles revealed subclonal CNA events, including subclonal amplifications in clone A (4p, 7q, 9p13.2–q22.2 and 17q) and subclonal amplifications in clone B (3p26.3–p25.1, 6q, 7p, 11q, Xp11.23 and Xq) that varied in the tumor mass.

How does that correspond to PURPLE's estimate of tumour purity? Would it be 44% + 28% = 72% (i.e. cancer cell fraction)? Or would it be reported as 44% (i.e. largest pure tumour group)? It might be useful to provide some clarifying sentences in the user guide.

p-priestley commented 1 year ago

PURPLE will try to find a fit which tries to make the highest proportion of the genome into integer copy values. This is constrained by penalties for less biologically plausible scenarios (ie. higher copy number). So in your example it may fit 28%, 44% or 72% depending on the relative amount of copy number that is shared between the clones and private to each one. If there is a lot of private copy number activity to each clone, then the fit is likely to fail to find a good solution