I'm currently trying to run PhyloWGS with copy number variation (CNV) data obtained from whole-exome sequencing. I used a different tool than Battenburg / TITAN to call CNVs and am trying to convert the CNV calls into a format similar to the provided cnv_data.txt.
However, I am having trouble understanding how to calculate the number of reference reads a covering a given CNV.
Our copy number calls give us the integer copy numbers of each allele and the prevalence of the CNV, e.g. (2,1) with prevalence 0.25. We also have reference and variant read counts for each SSM.
How would I calculate a from the above information? My default presumption is to multiply the CNV prevalence by total read count, but I was wondering if you had a different recommendation.
Additionally, I wanted to clarify my understanding of the example cnv_data.txt input file provided:
cnv a d ssms physical_cnvs
c0 66023,50883,62757,36056,58777 126755,100469,121941,71263,115417 s2,1,2;s4,0,1 chrom=1,start=1234,end=5678,major_cn=2,minor_cn=1,cell_prev=0.8;chrom=X,start=15,end=10000,major_cn=2,minor_cn=0,cell_prev=0.8;chrom=22,start=123,end=456,major_cn=1,minor_cn=0,cell_prev=0.8
This example shows that SSMs s2 and s4 overlap with CNV c0.
However, they appear to harbor different major/minor copy numbers. s2 harbors (1,2) whereas s4 harbors (0,1).
How can the same CNV have two different copy number states in the input?
Furthermore, why does the same CNV have different states in the last column? I see copy number states (2,1), (2,0), and (1,0) in the last column, but I thought a single CNV should have a single copy number state.
Hello PhyloWGS Devs!
I'm currently trying to run PhyloWGS with copy number variation (CNV) data obtained from whole-exome sequencing. I used a different tool than Battenburg / TITAN to call CNVs and am trying to convert the CNV calls into a format similar to the provided
cnv_data.txt
.However, I am having trouble understanding how to calculate the number of reference reads
a
covering a given CNV.(2,1)
with prevalence0.25
. We also have reference and variant read counts for each SSM.a
from the above information? My default presumption is to multiply the CNV prevalence by total read count, but I was wondering if you had a different recommendation.Additionally, I wanted to clarify my understanding of the example
cnv_data.txt
input file provided:This example shows that SSMs
s2
ands4
overlap with CNVc0
.s2
harbors(1,2)
whereass4
harbors(0,1)
.Thanks!