lculibrk / Ploidetect

Tumour purity, ploidy, and copy number variation from whole-genome sequence data
6 stars 0 forks source link

SNP array #17

Open ywzhang071394 opened 2 months ago

ywzhang071394 commented 2 months ago

Hi,

I saw Ploidetect requires the definition of snp array. Is that used for allele-specific CNV calling? If that is, can I replace the array by germline SNV sites or some common snp from dbSNP?

Thanks

lculibrk commented 2 months ago

We include hg19 and hg38 sites files for convenience, and recommend using those. You can in principle use your own sites files, but using large amounts of sites is computationally expensive, and we haven't tested using larger sets of SNPs so results may be affected.

SNP positions are used to compute beta allele fractions (BAFs), which are used both in modeling tumor purity/ploidy, as well as in calling allele-specific CNVs.

ywzhang071394 commented 2 months ago

Thank you for the quick response. Actually, we are using T2T reference and aimed to profile a comprehensive allele specific CNV landscape. The SNP array cannot cover enough SNP sites and yields many NA BAF values. That is not what we expected.

lculibrk commented 2 months ago

T2T is exciting!

I could imagine issues with (peri)centromeric regions, which are excluded by default. It may require some changes to the program to support t2t CNV calling that does not filter out centromeric regions. Please let us know if that is the case and we can accomodate this.

Typically the SNP array files have been sufficient to cover the vast majority of the segmented genome with at least some BAF values. The only thing you really need are the segmented results - if a segment contains 200 bins, 5 of which have BAF values, those BAF values can and are used to infer the allelic balance of the entire segment. The default cna.txt outputs one line per bin. There should be a cna_condensed.txt file (at least this is output by the Snakemake workflow) that has the aggregated CNV results, one line per segment, with the propagation of BAF values and allele-specific copy number computed.

When you report that there are many NA BAF values - is this the case in the aggregated segments? If it is, then there may be a bug.

ywzhang071394 commented 2 months ago

Thank you for your remind. Many NA BAFs are shown in the cna.txt file rather than the condensed file. For the centromere issue, I was confused about it. Because I have not seen any centrometric exclusion events in the "cna_condensed.txt" file. As shown in the below, chr1 is intact.

chr1 segments from cna_condensed.txt chr segment pos end CN state zygosity segment_depth A B 1 1 15092 113288333 1 1 HOM 37.1608200935512 1 0 1 2 113288333 114425828 2 3 HOM 58.2461150838094 2 0 1 3 114425828 115694278 1 1 HOM 39.6268224227761 1 0 1 4 115694278 145394879 2 3 HOM 59.1876098919785 2 0 1 5 145394879 145570495 0 0 HOM 18.9024518788914 0 0 1 6 145570495 145578986 1 1 HOM 35.8922159651138 1 0 1 7 145578986 248373311 2 3 HOM 60.0469238319276 2 0

ywzhang071394 commented 2 months ago

Another question is for chrX and Y. Our sample is male but two copies of chrX was reported. Are there any paremeters for this?

lculibrk commented 2 months ago

I took a look into why this might be, and it's a bug - the bug affects all of chrX equally in this case, and will not affect calling of CNV breakpoints or homozygous deletions/amplifications. A workaround for the time being would be to divide the copy number by 2 for chrX.