Closed thirtyeggs closed 9 years ago
Thanks for your question, @thirtyeggs!
I think bcftools isec
is intended as a lower-level command that can be used to perform various kinds of set operations on sets of VCF files (i.e. intersection, union, complement, etc). It does not have the same functionality as vcf-compare
, which is primarily to report on a comparison between two files.
In bcftools
, the command that would be most similar to vcf-compare
in this instance would be bcftools stats -c all a.vcf.gz b.vcf.gz
. That said, the output is not in exactly the same format, but requires some parsing to compare it.
For example, when I run vcf-compare isec.a.vcf.gz isec.b.vcf.gz
on some of the example VCF data from the bcftools tests, I get:
# This file was generated by vcf-compare.
# The command line was: vcf-compare(r840) isec.a.vcf.gz isec.b.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN 1 .. number of sites unique to this particular combination of files
#VN 2- .. combination of files and space-separated number, a fraction of sites in the file
VN 2 isec.b.vcf.gz (22.2%)
VN 3 isec.a.vcf.gz (30.0%)
VN 7 isec.a.vcf.gz (70.0%) isec.b.vcf.gz (77.8%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN Number of REF matches: 3
SN Number of ALT matches: 3
SN Number of REF mismatches: 4
SN Number of ALT mismatches: 0
SN Number of samples in GT comparison: 0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
SN Number of lost sites: 2 16.7% 12 10 isec.a.vcf.gz
SN Number of lost sites: 2 18.2% 11 9 isec.b.vcf.gz
Whereas with bcftools stats -c all isec.a.vcf.gz isec.b.vcf.gz
, the beginning of the output I get is:
# This file was produced by bcftools stats (1.2+htslib-1.2) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats -c all isec.a.vcf.gz isec.b.vcf.gz
#
# Definition of sets:
# ID [2]id [3]tab-separated file names
ID 0 isec.a.vcf.gz
ID 1 isec.b.vcf.gz
ID 2 isec.a.vcf.gz isec.b.vcf.gz
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 1
SN 1 number of samples: 1
SN 0 number of records: 3
SN 0 number of SNPs: 0
SN 0 number of MNPs: 0
SN 0 number of indels: 3
SN 0 number of others: 0
SN 0 number of multiallelic sites: 2
SN 0 number of multiallelic SNP sites: 0
SN 1 number of records: 2
SN 1 number of SNPs: 0
SN 1 number of MNPs: 0
SN 1 number of indels: 2
SN 1 number of others: 0
SN 1 number of multiallelic sites: 1
SN 1 number of multiallelic SNP sites: 0
SN 2 number of records: 7
SN 2 number of SNPs: 0
SN 2 number of MNPs: 0
SN 2 number of indels: 7
SN 2 number of others: 0
SN 2 number of multiallelic sites: 1
SN 2 number of multiallelic SNP sites: 0
...
In the vcf-compare
output, the numbers in the VN
lines correspond to the associated SN
lines in the bcftools stats
output. For example, vcf-compare
says I have 7 variants in common between the two files on line 10 and bcftools stats
agrees that there are 7 variants in common (ID 2; line 27).
I am honestly not quite sure about how to obtain the other vcf-compare
output in bcftools stats
- perhaps @pd3 would be able to provide additional documentation on how to achieve comparable results to vcf-compare
using bcftools stats
(perhaps add it to http://vcftools.sourceforge.net/perl_module.html#vcf-compare?)
In the meantime, I think this is "just" a documentation bug and not a functional problem with bcftools
. If you disagree and think that there is a bug, it would be very helpful if you could provide a minimal example (i.e. post links to some VCF files and command-lines that exhibit the behaviour you are seeing and how it differs from what you were expecting).
Cheers,
Josh.
Hi Josh,
Thank you for your clarification. You are correct. I could get the same numbers when I used "bcftools stats" instead of "bcftools isec". However, I need to generate three variant sets (unique to each set and intersection) and I don't know how to do it. I generated four variant sets 0000-0003.vcf using "bcftools isec -p . a.vcf.gz b.vcf.gz" and saw the statistics of each file using "bcftools stats -c all 000X.vcf". The reports from the commands were not matched with the numbers in "bcftools stats -c all a.vcf.gz b.vcf.gz". Therefore, I cannot use 000X.vcf files for my purpose.
You can download the vcf files that I used from https://drive.google.com/folderview?id=0B_r_jotTj8w9flJWLVQyMzNTSTNsZV9aM1ZLX1U2RUxuS2loMkMyVk4wYTUzUFlHcGNHUXc&usp=sharing
Thank you again for your support!
Yun
Hi Yun, thank you for the test files.
The -c
option controls how should be treated records with different variant types. For example, if the first file has a SNP and the second has an indel at the same position, should they be counted as matching records? The -c all
will match by position only, not taking the alleles into account; -c indels
will require exact match for SNPs but will collapse indels; -c both
will attempt to find SNP and indel records in both files and match them regardless of the ALT allele.
The default is -c none
, neither SNPs nor indels will be "collapsed", the same ALT allele must be at the same position in both files in order for the record to be flagged as shared by both files.
Therefore, to get the same numbers, you need to run with the same -c
option for both commands. Usually it makes sense to do the comparison for SNPs and indels separately, so one could run it as
$bt stats -c both a.vcf.gz b.vcf.gz
$bt isec -c both -p rmme/ a.vcf.gz b.vcf.gz
Hi @pd3,
I understand. When I used "-c both" both for stats and for isec, I could get the same numbers from them. It seems that vcf-compare compares vcf files as bcftools does with "-c all". I will use "bcftools stats -c both" instead of vcf-compare. Thank you very much your explanation. I appreciate that.
Thanks, Yun
OK, glad this is resolved.
Hi,
I compared two vcf files using "bcftools isec" and vcf-compare and I am seeing different results. Can you explain why? I am using bcftools 1.2.0 and vcftools 0.1.12b. The command line options that I used are:
vcf-comare a.vcf.gz b.vcf.gz bcftools isec -p . a.vcf.gz b.vcf.gz
Thank you for your help in advance!