samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
673 stars 240 forks source link

bcftools merge error: REF prefixes differ #988

Closed ksw9 closed 5 years ago

ksw9 commented 5 years ago

Hi, I am trying to merge VCFs across samples with bcftools merge and receiving the error: The REF prefixes differ: AGGGG vs ACGGCGGC (5,8). Indeed, in the single sample calls, the REF alleles do differ. Is there a best approach to dealing with this? I ran bcftools norm before attempting to merge the VCFs.

Thank you in advance!

Version information: bcftools 1.9-105-gaf6f0c9 Using htslib 1.9-118-g2da4c7d

pd3 commented 5 years ago

The basic assumption of the bcftools merge tool is that the input VCFs represent variants with respect to the same reference. In other words, the REF columns must match. Try bcftools norm --check-ref e --fasta-ref <ref.fa> to check that this is the case. With bcftools norm -c x it is possible to exclude mismatching records. However, you should make sure you understand the reasons of the discrepancies to prevent random nonsense results.

ksw9 commented 5 years ago

Fantastic, thank you for your help! It turns out I was using bcftools norm with an older version of bcftools. Problem resolved. Thank you for the quick help! Best,

ksw9 commented 5 years ago

Hi, I've just realized that an error persists for me when I use bcftools norm. When I use bcftools norm, I find strange behavior around an indel. This changes the positions in such a way that position 29483 comes before 29482. Any suggestions for how to deal with this would be great. Thank you!

Versions: bcftools 1.9-105-gaf6f0c9 Using htslib 1.9-118-g2da4c7d

# Original file
NC_000962.3     26960   .       C       .       .       .       END=29483;MinDP=2       GT:DP   0:2
NC_000962.3     29484   .       AA      A       228     .       AC=1;ADF=5,57;ADR=2,44;AN=1;DP=108;DP4=5,2,57,44;IDV=101;IMF=0.935185;INDEL;MQ=60;MQ0F=0;MQSB=1;QD=2.11;SGB=-0.693147;VDB=0.997587      GT:AD:DP:PL:SP  1:7,101:108:255,0:2

# After ${BCFTOOLS}  norm --fasta-ref ${ref} ${ann_vcf} -O z -o ${norm_vcf}
# Lines   total/split/realigned/skipped:    3169/0/150/0

NC_000962.3     26960   .       C       .       .       .       END=29483;MinDP=2       GT:DP   0:2
NC_000962.3     29482   .       CA      C       228     .       AC=1;ADF=5,57;ADR=2,44;AN=1;DP=108;DP4=5,2,57,44;IDV=101;IMF=0.935185;INDEL;MQ=60;MQ0F=0;MQSB=1;QD=2.11;SGB=-0.693147;VDB=0.997587      GT:AD:DP:PL:SP  1:7,101:108:255,0:2
pd3 commented 5 years ago

This is the intended behavior, the program trims and left-aligns indels.

ksw9 commented 5 years ago

Okay, thanks - is there then a best practice for how to resolve the issue that the homozygous - ref block ends after the next indel? Ie the block from 26960 ends at 29483 and the next variant is at 29482? Because using this VCF, I am unable to merge and get errors that the VCF is out of order.

Thank you!

pd3 commented 5 years ago

The result from bcftools norm should be sorted. If not, it should be considered a bug. However, I assume bcftools are not to be blamed for the unsorted records given the other discrepancies in the data you reported above. You can use bcftools sort to sort the records.

ksw9 commented 5 years ago

Excellent, thanks again for helping me sort out this problem.

srbehera commented 1 year ago

@ksw9 just checking if you remember how did you fix it? I tried with both --check-ref s --fasta-ref <ref.fa> and --check-ref x --fasta-ref <ref.fa>, but it gives me the following error, and the merged file contains only variants from Chr1.

Warning: The row size is too big for FORMAT/PL at chr1:23899810, requires 12204615120 bytes, skipping.
The REF prefixes differ: C vs N (1,1)
Failed to merge alleles at chr1:53128382 in HG01369.vcf.gz
Lines   total/split/realigned/skipped:  62437/4488/0/0
REF/ALT total/modified/added:   88924/0/688