samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

bcftools-1.19 csq does not add FORMAT/BCSQ to all lines #2070

Closed mmokrejs closed 6 months ago

mmokrejs commented 6 months ago

Hi Peter, I am opening a separate issue for this. I have a region which is encoding a protein, have ref.fasta and ref.gff3 for this region. I prepended fake START codon and appended fake STOP codon to make bcftools csq happy and execute the haplotype-aware interpretation of the changes eventually undoing/altering each others consequence. Actually also have prepended leading 978 Ns as padding for easier interpretation.

But, only two lines (sites) out of some hundreds are annotated by FORMAT/BCSQ. Why?

bcftools csq -f ref.fasta -g ref.gff3 -p m -n 150 some.vcf

My gff3 contains

ref ignored_field   gene    979 1440    .   +   .   ID=gene:SOMEGENE;biotype=protein_coding;Name=GENE1
ref ignored_field   transcript  979 1440    .   +   .   ID=transcript:12345678;Parent=gene:GENE1;biotype=protein_coding
ref ignored_field   exon    979 1440    .   +   .   Parent=transcript:12345678
ref ignored_field   CDS 979 1440    .   +   0   Parent=transcript:12345678

I could provide the input VCF in an email off the web.

pd3 commented 6 months ago

Records for which no functional consequence could be determined are left without annotation. Perhaps this is what you are observing?

mmokrejs commented 6 months ago

I sent you off-the-list a testcase. It does not happen in local-mode, only in the haplotype-aware mode.

pd3 commented 6 months ago

In the test case provided offline it turned out that the sample had GT=0, therefore there were no variants to apply. If all variants should be applied, provide -s -

-s, --samples -|LIST              Samples to include or "-" to apply all variants and ignore samples
mmokrejs commented 6 months ago

Thank you! That solves the issue for me with haplotype-aware caller, indeed. I am missing the meaning of that the sample had GT=0, therefore there were no variants to apply but that is my issue. I wanted to use the haplotype-aware mode so that if e.g. adjacent mutations undo each others' effect that no functional change is reported.

$ bcftools csq -f ref.fasta -g ref.gff3 -p m -n 150 --local-csq test.vcf | grep -v '^#' | grep -c CSQ # local mode does not show the problem
...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
440
$ bcftools csq -f ref.fasta -g ref.gff3 -p m -n 150 test.vcf | grep -v '^#' | grep -c CSQ # haplotype-aware mode shows the problem
...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
1
$ bcftools csq -f ref.fasta -g ref.gff3 -p m -n 150 test.vcf -s - | grep -v '^#' | grep -c CSQ # indeed '-s -' is a workaround
...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
440
$