samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
676 stars 240 forks source link

Sanity check failed, the reference sequence differs #691

Closed splaisan closed 7 years ago

splaisan commented 7 years ago

running bcftools stats -c both --fasta-ref $REF $VCF > $VCF.stats fails although the fasta reference is the same used for mapping and calling the vcf. looks like a IUPAC bug to me! Is there a way to bypass sanity and get results from this command (now dies empty)? Thanks

Sanity check failed, the reference sequence differs: NC_030973.1:129867+1 .. T vs W

samtools faidx REF.fasta NC_030973.1:129865-129870
>NC_030973.1:129865-129870
SYGWTS

bcftools 1.5 Using htslib 1.5

https://github.com/samtools/bcftools/blob/8198b0f599b04e793e58ae8d31fe0b8594f19fc5/vcfstats.c#L309

pd3 commented 7 years ago

I just added a support for IUPAC ambiguity codes in the reference for this particular case.

splaisan commented 7 years ago

Unless I missed something, installing a latest git bcftools over my build did not solve it for me

/opt/biotools/bin/bcftools --version
bcftools 1.6-14-geed5371
Using htslib 1.6-13-g7c5083b
Copyright (C) 2016 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ bcftools stats --fasta-ref   Ref.fasta variants.vcf.gz  > vcf.stats
Sanity check failed, the reference sequence differs: NC_030973.1:129900+2 .. G vs Y

Thanks for your feedback

pd3 commented 7 years ago

Does the command bcftools --version give the same answer as /opt/biotools/bin/bcftools --version?

If it does, could you please provide a small test case to reproduce the problem?

splaisan commented 7 years ago

yep ;-) could it be that the test you added do not include G vs Y?

$ bcftools --version bcftools 1.6-14-geed5371 Using htslib 1.6-13-g7c5083b Copyright (C) 2016 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

$ which bcftools /opt/biotools/bin/bcftools

$ /opt/biotools/bin/bcftools --version bcftools 1.6-14-geed5371 Using htslib 1.6-13-g7c5083b Copyright (C) 2016 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

pd3 commented 7 years ago

Sorry, I should have picked that up already from the error message. Of course it does not because Y={C,T}. The program does the right thing when it complains here.

splaisan commented 7 years ago

I agree but still find this error strange because the ref base at that position is a 'Y' so why calling this insanity! unless the first call ending with G is the cause and be more of a varscan error? the exact same fasta was used to make the BWA index and call variants I am lost here :-(

samtools faidx R64_SEUB3.0_merged.fasta NC_030973.1:129900-129902

NC_030973.1:129900-129902 CAY

$ tabix variants.vcf.gz NC_030973.1:129900-129902 NC_030973.1 129900 . CAG C . PASS ADP=22;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:49:33:22:10:13:48.15%:1.1242E-5:23:12:6:4:3:10 NC_030973.1 129902 . Y A . PASS ADP=23;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:41:33:23:0:8:47.06%:7.77E-5:0:17:0:0:6:2

pd3 commented 7 years ago

The VCF specification requires that the REF base is one of A,C,G,T,N and that it matches the fasta reference. Callers can use fasta files with IUPAC codes, but they have to output the proper REF base in the VCF. In case there is Y in the fasta, the REF base in the VCF could be in principle either C or T, but not G. In fact, the specification requires that the alphabetically first base must be used, so in your case the first record should have REF=CAC and the second C.

splaisan commented 7 years ago

which then means that varscan was guilty after all and that I will never be able to get stats for this vcf data except after correcting one by one each sanity error. Would there be a way to run bcftools sats in a lenient manner with an extra argument (-f !) to ignore sanity errors?

Thanks a lot Peter for your answers and help. Stephane