mskcc / facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.
134 stars 65 forks source link

fewer heterozygous variants #183

Open jsha129 opened 1 year ago

jsha129 commented 1 year ago

Dear FACETS team, thank you for developing this tool. I have been getting errors when running emcncf() because of an insufficient number of heterozygous variants. The following command reported 931 'het' from a vcf containing 5 samples. bcftools filter -i "FILTER = 'PASS' & FORMAT/GT = '0/1' & FORMAT/AF > 0.75 & FORMAT/AD > 25" 3_filtered.vcf.gz | grep -v "#" | wc -l I tried segmentation of 100, 1000 and 10000 when running pileup (-g -q15 -Q20 -P100 -r25,0) and get roughly 45 'hets'. Median MQ is 60 in INFO field. Could you please help clarify this and any suggestions on improving number of hets? I tried reducing values for '-Q'and -'r' and saw modest improvement. Thanks

veseshan commented 1 year ago

Are you using a targeted panel? Typical whole exome sequencing data will have more than 20k het SNPs. The targeted panel we use has more than a 1000. FACETS uses loci that are sufficiently spaced to avoid serial correlation. I wonder if your panel is covering such a limited space of the genome that you only get 45 hets.

jsha129 commented 1 year ago

Thank you for response. This is WGS.

On Thu, 20 Oct 2022, 5:56 am Venkatraman E. Seshan, < @.***> wrote:

Are you using a targeted panel? Typical whole exome sequencing data will have more than 20k het SNPs. The targeted panel we use has more than a

  1. FACETS uses loci that are sufficiently spaced to avoid serial correlation. I wonder if your panel is covering such a limited space of the genome that you only get 45 hets.

— Reply to this email directly, view it on GitHub https://github.com/mskcc/facets/issues/183#issuecomment-1284442215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRLYSB6THQ2LW2CTHYVIO3WEA76JANCNFSM6AAAAAARI2PFVY . You are receiving this because you authored the thread.Message ID: @.***>

veseshan commented 1 year ago

Then it can be due low depth of coverage or a mismatch between the genome build of the bam and the snp file

jsha129 commented 1 year ago

I see. thanks for pointing that out. I used hg38 and have 1000G snp file. Is there a way to supply the newer snp file? I tried preProcSample(rcmat, gbuild = "hg38") which made no difference. Median NOR.DP for the example data is ~100 vs ~20 for our data. is that sufficient for CNV calling? Thanks