samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
672 stars 240 forks source link

isec: miss-classification of duplicates #665

Open landesfeind opened 7 years ago

landesfeind commented 7 years ago

Hi, the isec sub tool does not place duplicates in the correct output. Please find below a detailed description to reproduce the issue.

I see that duplicates are generally a problematic issue and that my scenario might look somewhat arbitrary. But I believe this to happen more often if one integrates data from different sources that is putatively noise.

If the behavior of bcftools is on purpose or if you see this issue to be not bcftools' responsibility, please consider putting a warning about duplicates in the documentation.

Thank you for your hard work on the samtools/bcftools tools, Manuel

Input input1.vcf.gz

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample1
chr1    1    .   A    T    .     .       .     SOME_DATA  1
chr1    10   .   C    G    .     .       .     SOME_DATA  2

input2.vcf

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample2
chr1    1    .   A    C    .     .       .     SOME_DATA  3
chr1    10   .   C    G    .     .       .     SOME_DATA  4
chr1    10   .   C    G    .     .       .     SOME_DATA  5
chr1    10   .   C    T    .     .       .     SOME_DATA  6

Note: Sample2 contains two times that same variant with different data (4 and 5). These variants overlap with variant 2 in Sample1

Command line

bcftools isec -p test_out test.vcf.gz test2.vcf.gz

Note: the order of VCF files is irrelevant and the error is reproducible when changing the input order.

Result

#test_out/0000.vcf  for records private to  test.vcf.gz
test_out/0000.vcf:#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample1
test_out/0000.vcf:chr1    1    .   A    T    .     .       .     SOME_DATA  1

#test_out/0001.vcf  for records private to  test2.vcf.gz
test_out/0001.vcf:#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample2
test_out/0001.vcf:chr1    1    .   A    C    .     .       .     SOME_DATA  3
test_out/0001.vcf:chr1    10   .   C    G    .     .       .     SOME_DATA  5
test_out/0001.vcf:chr1    10   .   C    T    .     .       .     SOME_DATA  6

#test_out/0002.vcf  for records from test.vcf.gz shared by both
test_out/0002.vcf:#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample1
test_out/0002.vcf:chr1    10   .   C    G    .     .       .     SOME_DATA  2

#test_out/0003.vcf  for records from test2.vcf.gz shared by both
test_out/0003.vcf:#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT     Sample2
test_out/0003.vcf:chr1    10   .   C    G    .     .       .     SOME_DATA  4

Error description Sample2 variant 5 is called to be private for Sample2 even though it overlaps with Sample1 variant 2

The issue is reproducible with bcftools version 1.4 and 1.5

pd3 commented 7 years ago

The example suggests there are two separate issues. One is how to deal with multiallelic sites, i.e. there are two records with the same position in the two input files but with different alternate alleles. Such cases are supported, please check the -c option (documentation here http://samtools.github.io/bcftools/bcftools-man.html#common_options).

But then there is the other issue of having two records with the same position and the same alt allele in one file. Unfortunately, only one will be matched correctly. Such cases are not handled well. If you really need to keep the duplicates (I can't imagine why?), a workaround would be to create a copy with the duplicates removed (e.g. with the norm command), then generating a list of desired sites, and finally sub-setting the original VCF by position.

landesfeind commented 7 years ago

Thank you for your comment. I agree on the handling of multi-allelics and in my perspective isec handles them very well, i.e., chr1:10C>T is not overlapping.

In my particular example, I wanted to keep the duplicates because makes the downstream processing more easy and I can just use 'bcftools query' and then counting using uniq. Please see the following excerpt:

chr1    139038  .       T       C       .       .       CELLLINE=HMEL_BREAST;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1    139038  .       T       C       .       .       CELLLINE=JVM3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1    139038  .       T       C       .       .       CELLLINE=LN443_CENTRAL_NERVOUS_SYSTEM;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1    139038  .       T       C       .       .       CELLLINE=MEG01_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1    139038  .       T       C       .       .       CELLLINE=NCIH209_LUNG;EFFECT=AL627309.1:c.272A>G:p.H91R

As duplicates are not reported correctly, I would need to combine the INFO tags first and later on split them again.

I agree that a workaround is possible but suppressing the variant in the overlap as currently done by isec just did not seem to be correct for me either. The latter was the reason for reporting here.

pd3 commented 7 years ago

OK, I understand. The behavior of the program is indeed incorrect and it would be nice to have it fixed. Thank you for reporting the problem.