Open landesfeind opened 7 years ago
The example suggests there are two separate issues. One is how to deal with multiallelic sites, i.e. there are two records with the same position in the two input files but with different alternate alleles. Such cases are supported, please check the -c
option (documentation here http://samtools.github.io/bcftools/bcftools-man.html#common_options).
But then there is the other issue of having two records with the same position and the same alt allele in one file. Unfortunately, only one will be matched correctly. Such cases are not handled well. If you really need to keep the duplicates (I can't imagine why?), a workaround would be to create a copy with the duplicates removed (e.g. with the norm
command), then generating a list of desired sites, and finally sub-setting the original VCF by position.
Thank you for your comment. I agree on the handling of multi-allelics and in my perspective isec handles them very well, i.e., chr1:10C>T is not overlapping.
In my particular example, I wanted to keep the duplicates because makes the downstream processing more easy and I can just use 'bcftools query' and then counting using uniq. Please see the following excerpt:
chr1 139038 . T C . . CELLLINE=HMEL_BREAST;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1 139038 . T C . . CELLLINE=JVM3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1 139038 . T C . . CELLLINE=LN443_CENTRAL_NERVOUS_SYSTEM;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1 139038 . T C . . CELLLINE=MEG01_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE;EFFECT=AL627309.1:c.272A>G:p.H91R
chr1 139038 . T C . . CELLLINE=NCIH209_LUNG;EFFECT=AL627309.1:c.272A>G:p.H91R
As duplicates are not reported correctly, I would need to combine the INFO tags first and later on split them again.
I agree that a workaround is possible but suppressing the variant in the overlap as currently done by isec just did not seem to be correct for me either. The latter was the reason for reporting here.
OK, I understand. The behavior of the program is indeed incorrect and it would be nice to have it fixed. Thank you for reporting the problem.
Hi, the isec sub tool does not place duplicates in the correct output. Please find below a detailed description to reproduce the issue.
I see that duplicates are generally a problematic issue and that my scenario might look somewhat arbitrary. But I believe this to happen more often if one integrates data from different sources that is putatively noise.
If the behavior of bcftools is on purpose or if you see this issue to be not bcftools' responsibility, please consider putting a warning about duplicates in the documentation.
Thank you for your hard work on the samtools/bcftools tools, Manuel
Input input1.vcf.gz
input2.vcf
Note: Sample2 contains two times that same variant with different data (4 and 5). These variants overlap with variant 2 in Sample1
Command line
Note: the order of VCF files is irrelevant and the error is reproducible when changing the input order.
Result
Error description Sample2 variant 5 is called to be private for Sample2 even though it overlaps with Sample1 variant 2
The issue is reproducible with bcftools version 1.4 and 1.5