Closed BinglanLi closed 2 months ago
Merging of variants can be quite complex. There can be multiple lines with the same position and different combination of alleles in each file. However here it seems quite straightforward.
In the first case, we know that the genotype of the background sample at the indel position 94949281 is GT=0/0, i.e. no indel. However, the VCF carries no information regarding the position 94949282, therefore it inserts unknown/missing value GT=./..
A cure for this would be to work with gVCFs, then you have some information for every position of the genome and cases like this could be resolved.
Also there is the option --missing-to-ref
which inserts reference genotype 0/0 in such circumstances.
I hope this helps!
I noticed an edge case with
bcftools merge
. bcftools seems to have different behaviors When an INDEL is merged with a homozygous reference position (denoted by unspecified alleles,.
,<*>
, or<NON-REF>
).Test files are attached. Below are the file contents:
Case 1. Background + all_missing
When the background VCF is merged with all_missing.vcf.gz, it seems correct that
Sample_1
andSample_2
have missing genotypes for the INDEL variant. Nonetheless, it seems wrong to say that the background_sample is./.
at chr10:94949282.Case 2. Background + partial_missing
When the background VCF is merged with partial_missing.vcf.gz, I expected the result to be the same as case 1 because the genotypes at chr10:94949282 are unclear for these samples. However, I got the following results:
Case 3. Background + partial_present
Partial_present.vcf.gz only tells you the genotypes at chr10:94949281.
My questions are:
Test.zip