hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
187 stars 58 forks source link

Negative value of copyNumber in PURPLE #102

Closed Jelisaveta closed 4 years ago

Jelisaveta commented 4 years ago

Hello,

Could you please explain how PURPLE defines amplification and deletion when it comes to copyNumber parameter? It states that copyNumber column is Fitted absolute copy number of segment adjusted for purity and ploidy, so what would mean negative number for this column? (see attached)

Screen Shot 2020-07-02 at 11 20 38 AM
p-priestley commented 4 years ago

The definition is correct. It is the prediction of the absolute copy number, and agree that a negative value is clearly nonsense.

In this case you have a short region of ~8kb. The likely is that the purity may be low and there is very high read depth noise in this region which means that the gc normalised coverage is even lower than would be expected if there was a homozygous deletion in this region. This should be rare though as PURPLE penalises heavily against negative copy number.

We don't round copy number in PURPLE output though and we intentionally don't bound to 0 in the copy number output (as many other tools) so that issues such as this with fitting are not swept under the rug. For downstream reporting we do assume that a negative copy number region has copy number =0.

I just checked our database of 5000 samples and ~6% of samples have 1 or more region with copyNumber fit to < -0.5. Nearly all of these have purity < 20% and/or fail our QC checks. The regions are generally clustered around a few problematic sections of the genome (your example is not one of them).

DarioS commented 4 years ago

In terms of defining amplification and deletion, every journal article I read uses a different definition. PURPLE's driver gene analysis also has some as-yet-undocumented way to define AMP and DEL. There is no convention and it largely depends on if the gene is transcribed into RNA or not. If a gene has copy number of 2 or 10 could make no difference if it is not expressed in the cell type you're experimenting with. After all, 2 x 0 is the same as 10 x 0. Also, a copy number of 1 might be important if the gene is a haploinsufficient gene, or only a copy number of 0 might if it's not a haploinsufficient one. A workable definition you could use is COSMIC CNV's. A rule like COSMIC's isn't biologically correct in various situations, but it gives a way to begin answering questions like "what are the most frequently amplified genes in breast cancer".

Often, when numbers are negative, they are small numbers which were log-scaled, but that's not the case with PURPLE's output.

p-priestley commented 4 years ago

For the record, PURPLE uses 3x ploidy as a cutoff for amplification and absolute copy number < 0.5 for homozygous deleiton.

This is documented in our original article: https://www.nature.com/articles/s41586-019-1689-y#Sec10

We will add this to github at some point.

Jelisaveta commented 4 years ago

Thank you both a lot!