Open biojiangke opened 6 years ago
I believe that the difference is due to that "bgt import" ONLY imports entries if the FILTER field is "PASS"
I had seen that in the XXX.bgt.bcf you are lacking all the variants that has VQSRTrancheXXX in the FILTER field of the original VCF. When I count only PASS variants the bgt data has slightly more entries (due to splitting multiallelic variants to atomic).
So THERE SHOULD BE HUGE WARNING at the import manual that ONLY PASS variants are lifted. Remember FILTER, INFO (and may be ID?) VCF fields are cleared, so there is no way to distinguish between valid and invalid variants thus it seems logical to import only variants that are PASS.
I know technically both rs number (usually placed in ID) and the FILTER (PASS, etc) information could be placed into the variant annotation fmf file as an extra tag, however it wouldn't save space, and at the moment the included javascript does not lift it. On the contrary it would be nice if the VCF output could be more standard conformant and have these meaningful fields kept (included in the bgt bcf and queriable like you can query region etc)
Zoltan
Running BCFTOOLS and BGT for the same region, BCFTOOLS showed two SNPs but BGT (bgt built from the same vcd.gz) returned none. Not sure how widespread this problem would be. Could someone take a look at this?
Examples like following:
bcftools view -r 05:67948215-67948219 xxx.vcf.gz | grep -v "#" | wc -l 2 bgt view -r 05:67948215-67948219 -s sample.list xxx.bgt | grep -v "#" | wc -l 0
sample.list includes all sample names from the VCF header.