samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

bcftools-1.19 csq: generates BCSQ but some tools anticipate CSQ instead #2068

Closed mmokrejs closed 9 months ago

mmokrejs commented 9 months ago

Hi Peter, it seems to me some native bcftools tools or its plugins require CSQ while it seems you switched to BCSQ instead. Would you mind checking all the tools are in sync with each other? I used initially 1.17 but also 1.19 behaves same. Why is TBCSQ existing too? Where is the documentation?

$ bcftools csq -f ref.fasta -g ref.gff3 -p m -n 150 some.vcf | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n'`
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
The tag INFO/CSQ not found in the header
$

I have to add sed -e 's/BCSQ/CSQ/g'in between the pipes.

Still, bcftools +split-vep discards most of the input lines, dunno why. Only two lines are output. But that is a different bug. Edit: That was actually bcftools csq not annotating most of the lines with BCSQ, fixed for me by bcftools csq -s -, thanks to https://github.com/samtools/bcftools/issues/2070#issuecomment-1881468481.

pd3 commented 9 months ago

I don't think there is a bug, probably just a different expectation / misunderstanding.

From the usage page:

bcftools csq 
-c, --custom-tag STRING           Use this tag instead of the default BCSQ

bcftools +split-vep
-a, --annotation STR            INFO annotation to parse [CSQ]

and from the documentation (https://samtools.github.io/bcftools/bcftools.html#csq):

The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag (configurable with the -c option). The latter is a bitmask of indexes to INFO/BCSQ, with interleaved haplotypes. See the usage examples below for using the %TBCSQ converter in query for extracting a more human readable form from this bitmask.

mmokrejs commented 9 months ago

Hmm, so the value under CSQ is definitely different than BCSQ, fine. Then the sed is definitely not fixing the issue, because the data are still bitmasks. But what should I do as a user? I am lost. I asked +split-vep to give me a TSV output albeit with some fields separated by commas and some by pipe symbols.

pd3 commented 9 months ago

Unfortunately, I don't understand what the problem is. Reading the documentation and a little bit of experimentation should hopefully do.

As shown above, both commands have an option to override the default tag name. The FORMAT/BCSQ is a bitmask and the query command can be used to convert to a human readable form, as described in the documentation. The +split-vep usage page shows quite a few examples and links to more, I believe that should be enough to get one started.

I attribute this to a late Friday tiredness :)

mmokrejs commented 9 months ago

So from our off-line discussion now it is clearer to me. https://samtools.github.io/bcftools/howtos/plugin.split-vep.html anticipates CSQ which is normally output by VEP. To directly consume bcftools csq output one needs to use bcftools +split-vep -a BCSQ. Alternatively bcftoos csq -c CSQ must be specified and that I discovered already before reporting. Still I did not understand why the tools do not work together. Either of the two extra cmdline switches is needed.

As you wrote in an email, if you you make split-vep able to search for BCSQ if CSQ is not present, it would have solved my issue. Although I understand how it evolved I think it would be great to fix that. And the docs say nothing about this "incompatibility": https://samtools.github.io/bcftools/howtos/plugin.split-vep.html