Closed mmokrejs closed 3 years ago
I think you simply need to add a whitespace or tab in there. So this should work as expected (note that I moved the whitespace from before the []
brackets inside them):
bcftools +split-vep -f "%CHROM:%POS %ID %REF %ALT[ %AD][ %AF]\n" -d -s worst:missense+ -i 'IMPACT~"HIGH"' /tmp/gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt
Thank you, this [ %AF] [ %GT]
helps a bit. But I get two spaces as a field separator in the output. Seems this is also a problem for [%GT]
field. Should probably comma used to separate the values for each sample?
Hmm, [,%AF]
yields ,0.04,0.048
, with a leading comma. Same for semicolon.
Yeah, for the FORMAT fields you have to always include the separator inside the []
brackets as a leading character and then ensure that the same separator is not included before the brackets, so for your example two comments above this would be:
[ %AF][ %GT]
The expression simply does exactly what you specify and with the extra whitespace between the brackets (i.e. ] [
) will generate the double whitespace in front of the first GT entry in the output.
And which separator you want to use, depends on what you want to do downstream.
Could I prevent the leading separator instance from appearing? My aim is to get a tabular listing and then to import it into say, Excel. So, first to split by a space and then sub-split by comma or semicolon. The leading e.g. comma could be removed by '/ ,/ /'
regexp replacement but is that really necessary?
Depending on what you want to do downstream, you might also consider having one line per sample and site, which would be a tidy data format -- this would circumvent the need to have several levels per line to deparse.
In any case, I think the examples over at the bcftools query
docs might help you further.
And I think the behaviour you see is not actually a bug, but the intended behaviour. So unless I am not seeing something here or you want to indeed suggest some particular change in behaviour, I'd suggest closing this issue if you can solve your problem with the docs and the suggestions here.
So why does the leading sepatrator char gets printed at all?
Also, the [filter.c:2491 filters_init1] Error: the tag "IMPACT" is not defined in the VCF header
is bogus IMO.
There seem to be two issues here. Regarding the main error, please try with the latest version of bcftools, there were fixes and improvements.
The other issue is about query formatting; admittedly this could have been done better with the leading character not printed for the first sample. Unfortunately this is too late to change now as it would break backward compatibility. It is easy to work around this by following @dlaehnemann's suggestion: bcftools query -f '%CHROM\t%POS[\t%GT]\n'
yields columns separated by a single tab. If there is no other output before the square brackets, you can always stream through sed bcftools query -f '[\t%GT]\n' | sed 's,^\s\s*,,'
This seems to be resolved, closing now. Please open a new ticket if not.
Hi, I am maybe twisting use of bcftools as I served it with VCF file output by GATK-4.1.0.9 Mutect2, containing normal and a tumor sample pair. I guess bcftools is parsing INFO fields for both samples and somehow mixes the results together. My aim was to annotate the CSQ values but admittedly I forgot that there are two samples in the input. Should
bcftools +split-vep
complain loudly if two samples are in the input? I do not see in the manpage or runtime help text how to parse only values for a certain sample in the context of+split-vep
plugin, actually, would it make sense? If I am not mistaken it would be helpful to install a check into the software and also, improve the documentation.The example below is a bit complicated as
bcftools-1.10.2 +split-vep
requires me to enumerate all the fields appearing in the input to be parsed, while spitting out a confusing message IMO[filter.c:2491 filters_init1] Error: the tag "IMPACT" is not defined in the VCF header
The message should be rephrased, it complains about
IMPACT
missing from the-f "%CHROM:%POS %ID %REF %ALT [%AD] [%AF]\n"
argument, do not know why. The fieldIMPACT
is present in the VCF input file: gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txtDue to that, I keep the long list of columns I wanted to have printed out initially, although it could be shortened for demonstration purposes I assume.
In the below output please note
0.040.048
in th 6th column, originating from[%AF]
. That originates from0.04
of the first sample and from0.048
of the second sample.As I said above, the below error message does not actually complain about the input VCF file but about the fields I have omitted from the
-f
argument, don't know why.gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt