millanek / Dsuite

Fast calculation of Patterson's D (ABBA-BABA) and the f4-ratio statistics across many populations/species
160 stars 26 forks source link

compatible VCF format with pool sequencing data option #93

Closed Spork0527 closed 5 months ago

Spork0527 commented 5 months ago

Hi, I have vcf data pool sequencing generated from VarScan that seems not compatible with your -p option in Dsuite Dtrios. An error message of the AD field not found or something. Would you mind send me the format of the pool sequencing data that you build this code of -p option upon or explain to me how it dealt with AD field data? I may reformat my dataset a little bit to make it compatible.

Spork0527 commented 5 months ago

Figured out what happened here. The AD field has to be in the format of Reference_Allele_Counts,Alternative_Allele_Counts, just as that specified in GATK format VCF. For example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878

20 10001019 . T G 364.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537 GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480 20 10001298 . T A 884.77 . AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765 GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0 20 10001436 . A AAGGCT 1222.73 . AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836 GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0 20 10001474 . C T 843.77 . AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302 GT:AD:DP:GQ:PL 1/1:0,27:27:81:872,81,0 In some VCFs that are processed by other programs, such as VarScan, there are multiple fields including DP (read depth), RD (reference depth), and AD (alternative depth), with each separated by colon. The AD in GATK vcf is instead a combination of RD and AD in Varscan vcf.

millanek commented 5 months ago

Hi @Spork0527 ... thanks for updating to provide the solution. Indeed, I wrote this extension using GATK VCFs as test data ;).

Milan