samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
682 stars 240 forks source link

mismatch between "bcftools isec" and vcf-compare #267

Closed thirtyeggs closed 9 years ago

thirtyeggs commented 9 years ago

Hi,

I compared two vcf files using "bcftools isec" and vcf-compare and I am seeing different results. Can you explain why? I am using bcftools 1.2.0 and vcftools 0.1.12b. The command line options that I used are:

vcf-comare a.vcf.gz b.vcf.gz bcftools isec -p . a.vcf.gz b.vcf.gz

Thank you for your help in advance!

jrandall commented 9 years ago

Thanks for your question, @thirtyeggs!

I think bcftools isec is intended as a lower-level command that can be used to perform various kinds of set operations on sets of VCF files (i.e. intersection, union, complement, etc). It does not have the same functionality as vcf-compare, which is primarily to report on a comparison between two files.

In bcftools, the command that would be most similar to vcf-compare in this instance would be bcftools stats -c all a.vcf.gz b.vcf.gz. That said, the output is not in exactly the same format, but requires some parsing to compare it.

For example, when I run vcf-compare isec.a.vcf.gz isec.b.vcf.gz on some of the example VCF data from the bcftools tests, I get:

# This file was generated by vcf-compare.
# The command line was: vcf-compare(r840) isec.a.vcf.gz isec.b.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN      2       isec.b.vcf.gz (22.2%)
VN      3       isec.a.vcf.gz (30.0%)
VN      7       isec.a.vcf.gz (70.0%)   isec.b.vcf.gz (77.8%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      Number of REF matches:  3
SN      Number of ALT matches:  3
SN      Number of REF mismatches:       4
SN      Number of ALT mismatches:       0
SN      Number of samples in GT comparison:     0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
SN      Number of lost sites:   2       16.7%   12      10      isec.a.vcf.gz
SN      Number of lost sites:   2       18.2%   11      9       isec.b.vcf.gz

Whereas with bcftools stats -c all isec.a.vcf.gz isec.b.vcf.gz, the beginning of the output I get is:

# This file was produced by bcftools stats (1.2+htslib-1.2) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats  -c all isec.a.vcf.gz isec.b.vcf.gz
#
# Definition of sets:
# ID    [2]id   [3]tab-separated file names
ID      0       isec.a.vcf.gz
ID      1       isec.b.vcf.gz
ID      2       isec.a.vcf.gz   isec.b.vcf.gz
# SN, Summary numbers:
# SN    [2]id   [3]key  [4]value
SN      0       number of samples:      1
SN      1       number of samples:      1
SN      0       number of records:      3
SN      0       number of SNPs: 0
SN      0       number of MNPs: 0
SN      0       number of indels:       3
SN      0       number of others:       0
SN      0       number of multiallelic sites:   2
SN      0       number of multiallelic SNP sites:       0
SN      1       number of records:      2
SN      1       number of SNPs: 0
SN      1       number of MNPs: 0
SN      1       number of indels:       2
SN      1       number of others:       0
SN      1       number of multiallelic sites:   1
SN      1       number of multiallelic SNP sites:       0
SN      2       number of records:      7
SN      2       number of SNPs: 0
SN      2       number of MNPs: 0
SN      2       number of indels:       7
SN      2       number of others:       0
SN      2       number of multiallelic sites:   1
SN      2       number of multiallelic SNP sites:       0
...

In the vcf-compare output, the numbers in the VN lines correspond to the associated SN lines in the bcftools stats output. For example, vcf-compare says I have 7 variants in common between the two files on line 10 and bcftools stats agrees that there are 7 variants in common (ID 2; line 27).

I am honestly not quite sure about how to obtain the other vcf-compare output in bcftools stats - perhaps @pd3 would be able to provide additional documentation on how to achieve comparable results to vcf-compare using bcftools stats (perhaps add it to http://vcftools.sourceforge.net/perl_module.html#vcf-compare?)

In the meantime, I think this is "just" a documentation bug and not a functional problem with bcftools. If you disagree and think that there is a bug, it would be very helpful if you could provide a minimal example (i.e. post links to some VCF files and command-lines that exhibit the behaviour you are seeing and how it differs from what you were expecting).

Cheers,

Josh.

thirtyeggs commented 9 years ago

Hi Josh,

Thank you for your clarification. You are correct. I could get the same numbers when I used "bcftools stats" instead of "bcftools isec". However, I need to generate three variant sets (unique to each set and intersection) and I don't know how to do it. I generated four variant sets 0000-0003.vcf using "bcftools isec -p . a.vcf.gz b.vcf.gz" and saw the statistics of each file using "bcftools stats -c all 000X.vcf". The reports from the commands were not matched with the numbers in "bcftools stats -c all a.vcf.gz b.vcf.gz". Therefore, I cannot use 000X.vcf files for my purpose.

You can download the vcf files that I used from https://drive.google.com/folderview?id=0B_r_jotTj8w9flJWLVQyMzNTSTNsZV9aM1ZLX1U2RUxuS2loMkMyVk4wYTUzUFlHcGNHUXc&usp=sharing

Thank you again for your support!

Yun

pd3 commented 9 years ago

Hi Yun, thank you for the test files.

The -c option controls how should be treated records with different variant types. For example, if the first file has a SNP and the second has an indel at the same position, should they be counted as matching records? The -c all will match by position only, not taking the alleles into account; -c indels will require exact match for SNPs but will collapse indels; -c both will attempt to find SNP and indel records in both files and match them regardless of the ALT allele. The default is -c none, neither SNPs nor indels will be "collapsed", the same ALT allele must be at the same position in both files in order for the record to be flagged as shared by both files.

Therefore, to get the same numbers, you need to run with the same -c option for both commands. Usually it makes sense to do the comparison for SNPs and indels separately, so one could run it as

$bt stats -c both a.vcf.gz b.vcf.gz
$bt isec -c both -p rmme/ a.vcf.gz b.vcf.gz
thirtyeggs commented 9 years ago

Hi @pd3,

I understand. When I used "-c both" both for stats and for isec, I could get the same numbers from them. It seems that vcf-compare compares vcf files as bcftools does with "-c all". I will use "bcftools stats -c both" instead of vcf-compare. Thank you very much your explanation. I appreciate that.

Thanks, Yun

pd3 commented 9 years ago

OK, glad this is resolved.