Open mjafin opened 9 years ago
Can you give concrete examples? Also have a look at the source code. It is Ruby, and pretty easy to read.
Hi @pjotrp, thanks for looking into this, happy to provide an example. Say I wanted to get tab-delimited output from vcf in the following way:
zcat my.vcf.gz | bio-vcf --sfilter 'rec.filter=="PASS"' --eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,r.info.known,r.info.lof,r.info.simple_ann]' --seval '[s.pr]'
Then if all the fields contain one value, I'll have 9 columns. However, if e.g. the KNOWN
field contains more than one value, separated by commas, there'll be more than 9 columns on those lines. In our case KNOWN
is a list of genes affected by a structural variant. I'd like to have them comma separated in the tab-delimited output rather than tab-delimited within tab-delimited if that makes sense.
OK. Can you also paste a few relevant VCF lines?
Certainly, the two below variants will have different numbers of fields due to one having more stuff in KNOWN
chr11 69465986 76 N <DEL> . PASS AC=0;AN=0;ANN=<DEL>|frameshift_variant&stop_lost|HIGH|CCND1|CCND1|transcript|NM_053056.2|Coding|5/5|c.825_*415del|p.Glu275fs|1034/4289|825/888|275/295||,<DEL>|3_prime_UTR_variant|MODIFIER|CCND1|CCND1|transcript|NM_053056.2|Coding|5/5|c.825_*415del||||||;CIEND=-352,10;CIEND95=-58,0;CIPOS=-6,296;CIPOS95=0,58;END=69466465;IMPRECISE;KNOWN=CCND1;LOF=(CCND1|CCND1|1|1.00);PE=6;SIMPLE_ANN=DEL|FRAMESHIFT_VARIANT&STOP_LOST|CCND1|NM_053056.2|,DEL|3_PRIME_UTR_VARIANT|CCND1|NM_053056.2|;SR=0;STRANDS=+-:6;SU=6;SVLEN=-479;SVTYPE=DEL GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB 0/0:6:6:0:200.0:0.0:-3.0,-247.0,-825.0:830:828:1:828:1:0:0:828:1:0.0012
chr12 4382934 79 N <DEL> . PASS AC=0;AN=0;ANN=<DEL>|frameshift_variant&start_lost|HIGH|CCND2|CCND2|transcript|NM_001759.3|Coding|1/5|c.-272_180del|p.Met1fs|34/6522|1/870|1/289||,<DEL>|5_prime_UTR_variant|MODIFIER|CCND2|CCND2|transcript|NM_001759.3|Coding|1/5|c.-272_180del||||||;CIEND=-352,10;CIEND95=-73,0;CIPOS=-8,341;CIPOS95=0,73;END=4383386;IMPRECISE;KNOWN=CCND2,CCND2-AS1,CCND2-AS2;LOF=(CCND2|CCND2|1|1.00);PE=4;SIMPLE_ANN=DEL|FRAMESHIFT_VARIANT&START_LOST|CCND2|NM_001759.3|,DEL|5_PRIME_UTR_VARIANT|CCND2|NM_001759.3|;SR=0;STRANDS=+-:4;SU=4;SVLEN=-452;SVTYPE=DEL GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB 0/0:4:4:0:200.0:0.0:-1.0,-175.0,-592.0:599:597:2:597:2:0:0:597:2:0.0033
OK, I see what you mean now. The thing is that magic happens at two levels. bio-vcf automatically realizes it is a list on parsing and splits that into an Array. That is really nice to have because you can refer to an indexed r.info.known[2], for example, in filters etc. When outputting the array, again, it knows it is a list and prints it out with tabs. The workaround it join the list again on output. Do something like:
--eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,r.info.known.join(","),r.info.lof,r.info.simple_ann]'
And you are back to normal. Remember these are Ruby statements, so you can do anything. You could even create your own print statements. Above is the easy fix, for sure.
Hi @pjotrp Thanks for the feedback again. I've tried the above .join(",")
solution but it falls apart when there is only one entry (or no entries) as strings do not have a join-function.
I have not run this, but you could try a conditional on the type
--eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,(r.info.known.kind_of?(String) ? r.info.known : r.info.known.join(",")),r.info.lof,r.info.simple_ann]'
it is kinda ugly and will fail when known is nil. Nil would warrant an extra conditional.
A more solid solution would to introduce a transformer. I'll keep your issue open until I come up with the 'right' solution.
Thanks - I know ruby very little but could I write a little function before the brackets that would do the join (or not do the join) on multiple of the inputs, similarly checking if the input is an array, string or nil?
That can work. Ruby can be Lisp-like. Eval is a block, so you can define stuff. On a single line split commands with semi-colon. Does the following work?
--eval 'f=lambda{|in| ( in.kind_of?(String) ? in : in.join(",") } ; [f.call(r.info.known)]'
I just tried on one of the test files in the source tree the following:
bioruby-vcf$ cat test/data/input/somaticsniper.vcf |./bin/bio-vcf --eval 'f=lambda{|x| ( x.kind_of?(String) ? x : x.join(",")) } ; [f.call(r.tumor.dp4)]'
Resulting in
2,0,1,1
0,2,1,1
1,1,2,0
2,2,1,1
1,1,0,2
3,2,1,1
1,0,2,0
That worked.
A in-line transformer in fact. One thing I realise I have, but don't use here, is the VCF header information. So far, I don't use the header on purpose - not wanting to depend on faulty headers. But in this case the header information could give the necessary hint and return a list of one instead of a String. Hmm.
Nice, thanks, I'll give that a go
This seems to have done it for me:
f=lambda{|x| ( x.nil? ? "" : (x.kind_of?(String) ? x : x.join(","))) } ;
Pjotr; Thanks for the workaround and all the suggestions. Is there any chance bio-vcf could make this easier internally? Ideally it would always return an array even with empty or items of size 1. Basing it on the header makes good sense. I agree that headers are often wrong but in this case it would help a lot to avoid having to special case everything. Thanks again.
Yeah. I was just trying to show how powerful the command line can be, even when tweaking default behaviour. I'll add this example to the docs.
I think we have to use the header and/or provide a special transformer.
The issue is listed as an 'enhancement' to be.
Related to this, sometime fields can come out as a nil, String or Array. Not so nice:
--eval [r.pos,r.info.ann] --filter 'f=lambda{|x| ( x == nil ? "" : x.kind_of?(String) ? x : x.join(",")) } ; f.call(r.info.ann) =~ /PIK3CA|GATA3/'
Hi there, Thanks for bio-vcf, I've found it incredibly useful and fast. I may be missing something, but when I'm using tab-delimited output I sometimes have fields that can have zero, one or more comma delimited values, like for example under
ANN
. It looks like these are automatically tab-delimited in the output and therefore I can't predict the number of columns on each line ahead. Is there a way to change this behaviour so that within field delimiter would be, say, comma?