samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

subtract common FP entries from hap.py evaluation #1880

Closed Overcraft90 closed 1 year ago

Overcraft90 commented 1 year ago

Hi there I have two VCF files for HG002 benchmarked with hap.py

How can I filter out/subtract from this second VCF file the false positives (FPs) not concerned with the sub-sampling process so that I'm sure in the resulting VCF I look only at errors introduced by sub-sampling? Currently, I'm running this command

bcftools isec -C -o snps_assess.vcf.gz -O z -w 1 subsampling.vcf.gz original.vcf.gz

But it seems to not be working correctly..., for one I cannot get the output VCF in a compressed format; secondly, I'm not sure the command is correct and is doing what is supposed to. Any help is appreciated, thanks in advance!

pd3 commented 1 year ago

This is too vague unfortunately, what makes you think the program is not doing what is supposed to do?

The command

bcftools isec -C A.vcf.gz  B.vcf.gz

will print sites in A not present in B, that's what it does and nothing else.

The compressed VCF should work in all recent versions of bcftools, maybe you are using an old version?

$ bcftools isec -Oz -C A.vcf.gz B.vcf.gz -w 1 -o rmme.vcf.gz
$ htsfile rmme.vcf.gz
rmme.vcf.gz:    VCF version 4.1 BGZF-compressed variant calling data
$ bcftools --version
bcftools 1.17-5-g36f1b0c1
Overcraft90 commented 1 year ago

Hi @pd3 thanks for the prompt reply,

Sorry for the vague description I'm just trying to wrap my head around what's going on. For instance, without the -w flag I get a BED file in a list format, so one question is what the -w does specifically.

Also, I see what you mean when you said that all recent versions of the tool (I'm currently running bcftools 1.16) compress the output file; however, I was getting confused as I've always been running zcat <filename> | less -S for visualising compressed files. But it seems, and I just verified myself today, that a simple less would do. Thanks again.

pd3 commented 1 year ago

The -w option may seem a bit confusing in this context as with -C there is no ambiguity in which of the files should be printed: the one listed on the command line as first, i.e. A.vcf.gz in my example. However, if your command had, say, two files and the -n =2 option, then the -w option can be used to choose which files should be created in the -p directory: the first (-w 1), the second (-w 2), or both (-w 1,2).

I don't understand the comment about zcat vs less, it seems to suggest that you are indeed getting an uncompressed file despite -Oz. I hope this is not an issue, but if yes, please provide a small test case to reproduce the problem.

Overcraft90 commented 1 year ago

Hi again @pd3,

Thanks for the clarification about -w 1. Concerning the (un)compressed output file, below the command I used

bcftools isec -Oz -C /media/scratch/9.sub-sampling_pangenome_graphs/happy_results_hap-samp/eval_snpsFP.vcf.gz /media/scratch/9.sub-sampling_pangenome_graphs/happy_results_filtered/eval_snpsFP.vcf.gz -w 1 -o igv_snpsFP.vcf.gz

Also, I attach the VCF files I'm trying to get the complement of as well as the output; let me know whether this help, thanks again. eval_snpsFP_filtered-graph.vcf.gz eval_snpsFP_hapsamp-graph.vcf.gz igv_snpsFP.vcf.gz

P.S. These are the complement files only for the FPs analysis for the hap.py evaluation, for the FNs I followed exactly the same procedure; also, I didn't include indexes which are quite fast to generate

pd3 commented 1 year ago

I am not finding a problem

$ bcftools isec -Oz -C eval_snpsFP_filtered-graph.vcf.gz eval_snpsFP_hapsamp-graph.vcf.gz -w 1 -o rmme.vcf.gz
$ htsfile rmme.vcf.gz
rmme.vcf.gz:    VCF version 4.1 BGZF-compressed variant calling data

In this specific scenario this works even if one leaves -Oz out as -o recognizes the desired type from the file name suffix

$ bcftools isec -C eval_snpsFP_filtered-graph.vcf.gz eval_snpsFP_hapsamp-graph.vcf.gz -w 1 -o rmme.vcf.gz
$ htsfile rmme.vcf.gz
rmme.vcf.gz:    VCF version 4.1 BGZF-compressed variant calling data