samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

bcftools merge assumed reference for INDELs #2163

Closed BinglanLi closed 2 months ago

BinglanLi commented 2 months ago

I noticed an edge case with bcftools merge. bcftools seems to have different behaviors When an INDEL is merged with a homozygous reference position (denoted by unspecified alleles, ., <*>, or <NON-REF>).

Test files are attached. Below are the file contents:

# A file to be merge against that has an INDEL
$ bcftools view background.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  background_sample
chr10   94949281    rs9332131   GA  G   .   PASS    .   GT  0/0

# A file with missing genotypes at the INDEL positions
$ bcftools view all_missing.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_1    Sample_2
chr10   94949281    .   G   .   .   PASS    .   GT  ./. ./.
chr10   94949282    .   A   .   .   PASS    .   GT  ./. ./. 

# A file with partial missing genotypes at the INDEL positions
$ bcftools view partial_missing.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_1    Sample_2
chr10   94949281    .   G   .   .   PASS    .   GT  0/0 0/0
chr10   94949282    .   A   .   .   PASS    .   GT  ./. ./.

# A file with part of the genotypes at the INDEL positions
$ bcftools view partial_present.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_1    Sample_2
chr10   94949281    .   G   .   .   PASS    .   GT  0/0 0/0

# test version
$ bcftools -v                         
bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Case 1. Background + all_missing

When the background VCF is merged with all_missing.vcf.gz, it seems correct that Sample_1 and Sample_2 have missing genotypes for the INDEL variant. Nonetheless, it seems wrong to say that the background_sample is ./. at chr10:94949282.

$ bcftools merge background.vcf.gz all_missing.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  background_sample   Sample_1    Sample_2
chr10   94949281    rs9332131   GA  G   .   PASS    .   GT  0/0 ./. ./.
chr10   94949282    .   A   .   .   PASS    .   GT  ./. ./. ./.

Case 2. Background + partial_missing

When the background VCF is merged with partial_missing.vcf.gz, I expected the result to be the same as case 1 because the genotypes at chr10:94949282 are unclear for these samples. However, I got the following results:

$ bcftools merge background.vcf.gz partial_missing.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  background_sample   Sample_1    Sample_2
chr10   94949281    rs9332131   GA  G   .   PASS    .   GT  0/0 0/0 0/0
chr10   94949282    .   A   .   .   PASS    .   GT  ./. ./. ./.

Case 3. Background + partial_present

Partial_present.vcf.gz only tells you the genotypes at chr10:94949281.

$ bcftools merge background.vcf.gz partial_present.vcf.gz
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  background_sample   Sample_1    Sample_2
chr10   94949281    rs9332131   GA  G   .   PASS    .   GT  0/0 0/0 0/0

My questions are:

  1. Shouldn't cases 1 and 2 have the same merge results?
  2. What should people expect for case 3 when only part of an INDEL position is present in the other VCF?

Test.zip

pd3 commented 2 months ago

Merging of variants can be quite complex. There can be multiple lines with the same position and different combination of alleles in each file. However here it seems quite straightforward.

In the first case, we know that the genotype of the background sample at the indel position 94949281 is GT=0/0, i.e. no indel. However, the VCF carries no information regarding the position 94949282, therefore it inserts unknown/missing value GT=./..

A cure for this would be to work with gVCFs, then you have some information for every position of the genome and cases like this could be resolved.

Also there is the option --missing-to-ref which inserts reference genotype 0/0 in such circumstances.

I hope this helps!