samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
649 stars 240 forks source link

Issue with Reference Allele Mismatch in bcftools norm,Reference allele mismatch ... #2037

Closed karen916 closed 9 months ago

karen916 commented 10 months ago

Dear bcftools Developers,

I am encountering a reference allele mismatch issue while using bcftools norm for VCF file normalization. I am working with chromosomal structural variations (SVs) data detected by Lumpy, Manta, and Delly, and merged using the SURVIVOR tool. My command is as follows:

bcftools norm -f /home/chenzhaojin/sv_test/GCF_000003025.6_Sscrofa11.1_genomic.fna filter1_DEL.sorted.vcf.gz -Oz -o filter1_DEL.sorted.norm.vcf.gz During the process, I encountered the following error: Reference allele mismatch at NC_010443.5:429877 .. REF_SEQ:'G' vs VCF:'N'

I understand that I can ignore such warnings using the -c w option, but when I input the processed VCF file into the next software (Paragraph), I encounter a new error:

Exception: Different padding base for REF and ALT at NC_010444.4:148923907

Could this Reference allele mismatch issue potentially affect the processing in subsequent software? If so, is there a recommended method to address or circumvent this issue? For this scenario (post-merging structural variation data processing), are there any recommended strategies or best practices for using bcftools norm? Any additional advice or solutions would be greatly appreciated.

pd3 commented 9 months ago

The recommended strategy is to find out why the REF allele does not match the fasta reference. The program found that the fasta reference and the VCF have different base at that position, G vs N.

You can use the --check-ref s option to forcibly rewrite the VCF's REF allele to whatever is found in the fasta file. If you are confident that the coordinates are correct and understand why REF was set to N by the program that generated the VCF, then it should be safe.