samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
649 stars 240 forks source link

Filter vcf files by PL field #1860

Closed GuoanQi1996 closed 1 year ago

GuoanQi1996 commented 1 year ago

Greetings! I am working with vcf files without GQ field, and i want to set the genotype with low genotype quality (say <20) to missing. Since there is no GQ field, thing could be done by filtering the second largest PL, which to my understanding is exactly the GQ value (GQ = the second largest PL - the lowest PL (always be 0)). Could this be achieved by bcftools? There are MIN and MAX functions can be used in "bcftools filter", while in my case a "MIDDLE" function seems fit. Or is there any idea on how to add the GP fieds? Thanks!

pd3 commented 1 year ago

GQ, in general, is not just the PL difference. There is more math to it, depending which software you are using. For example bcftools call -m does it like this http://samtools.github.io/bcftools/call-m.pdf.

With the existing tools you could create an approximate genotype probability (GP) using the tag2tag plugin

bcftools +tag2tag in.bcf -- --PL-to-GP

and filter based on that.

Another option would be to add a function along the lines you are suggesting, but perhaps making it more general and adding a sort function which would be followed by selecting the element you want, e.g.

SORT(PL)[1] .. the second smallest value
SORT(PL)[0] .. minimum

Not sure how useful that would be and is it worth the effort.

GuoanQi1996 commented 1 year ago

@pd3 Thanks for your reply. Followed your advice firstly I tried, bcftools +tag2tag test.vcf.gz -- --PL-to-GP It returns unrecognized option '--PL-to-GP'. Then I modified the command following the plugin options, bcftools +tag2tag test.vcf.gz -- --pl-to-gp It still error out unrecognized option '--pl-to-gp'. I suspect the version of mine (bcftools 1.11, htslib 1.11) is outdated, but this might be the highest version I could installed on our server, as hardware/system limitation.

Then I use another command as your suggested, bcftools filter -i 'SORT(PL)[1]>20' -S . -Oz -o test.GQ20.vcf.gz test.vcf.gz It errors out Error: the tag "SORT" is not defined in the VCF header.

Fortunately I found another one function MEDIAN that is recongnized in my version, bcftools filter -i 'MEDIAN(PL)>20' -S . -Oz -o test.GQ20.vcf.gz test.vcf.gz

It works fine(the GQ calculation is not that simply as you mentioned, but let's put it aside for now), but when I look into the output file, all site without PASS mark were set to missing, and all sites with PASS mark but have the median of their sample-level PL lower than 20 is unchanged. The bolded records have unexpected behaviour, as shown below.

the record without PASS mark, the last one genotype should be kept but reset to missing.

A01 1581 . G A 116 . DP=1480;VDB=0.878682;SGB=27.2813;RPB=0.976268;MQB=0.234901;MQSB=0.136887;BQB=0.100593;MQ0F=0;ICB=0.999986;HOB=2.80548e-05;AC=0;AN=0;DP4=667,717,2,8;MQ=54 GT:PL ./.:0,12,143 ./.:0,9,119 ./.:0,30,255 ./.:0,9,110 ./.:0,18,202 ./.:0,6,72 ./.:0,27,255

the record with PASS mark, the last one genotype should be set to missing but unchanged.

A01 2062 . G T 421 PASS DP=2609;VDB=0.177598;SGB=95.7709;RPB=0.803956;MQB=4.72637e-07;MQSB=0.902234;BQB=0.410184;MQ0F=0;ICB=0.999986;HOB=2.80548e-05;AC=2;AN=534;DP4=1218,1232,9,16;MQ=50 GT:PL 0/0:0,24,243 0/0:0,33,255 0/0:0,24,212 0/0:0,30,255 0/0:0,48,255 0/0:0,18,202

Any suggestions?

Thanks!

pd3 commented 1 year ago

That's correct, 1.11 is old, we are at 1.16 and 1.17 is imminent. Best to use the latest version http://samtools.github.io/bcftools/howtos/install.html

The function SORT does not exist, that was me thinking how this could be supported.

Regarding the filter results with MEDIAN, it seems it's doing what it's told, isn't it? The median of 0,27,255 is 27 and is bigger than 20, therefore the genotype is set to missing. And the median of 0,18,202 is 18 which is smaller than 20 therefore the genotype is left unchanged.

By the way, there is also the +setGT plugin which gives slightly more flexibility, although that's not relevant in your case and especially since you are using an old version.

GuoanQi1996 commented 1 year ago

Finally I figure it out. What I really want is to filter by GQ/PL per sample, so the applied function should be sMEDIAN rather than MEDIAN. Just for reference my command is bcftools filter -i 'sMEDIAN(PL)>=20' -S . -Oz -o test.GQ20.vcf.gz test.vcf.gz

Thanks for your help! @pd3

GuoanQi1996 commented 1 year ago

@pd3 Hi. Sorry to bother again. I use the following command to mark "missing" for genotype and variants not pass the threshold bcftools filter -i 'QUAL>20 & sMEDIAN(PL)>=20' -S . -Oz -o test.QC20.vcf.gz test.vcf.gz test.vcf.gz then I remove the variants with missing rate over 20% by bcftools filter -i 'F_MISSING<0.2' -Oz -o test.core20.vcf.gz test.QC20.vcf.gz

It worked fine however it filtered most of the variants out. I think I have set the QC thresholds too strict, and simply considered the median of PL as GQ is a stupid idea. So back to the original question, are their any advise on how much QUAL or PL thresholds should be used if I have to filter based on the PL field?

Just for note if helpful, the filtered dataet is merged from multiple sequence panels, all variants were re-called from the raw sequence data using bcftools, and I have normlized the raw separate datasets using bcftools norm bcftools norm --fasta-ref Genome.fa --multiallelics -both --check-ref s -Oz -o data1.norm.vcf.gz data1.vcf.gz

Thank you!

pd3 commented 1 year ago

Unfortunately, I cannot give an advice like that. I don't filter based on FORMAT/PL or GQ values and restrain from giving advice on filtering in general, it's a bit of a black magic and in my experience each data set tends to be unique.