vcflib / bio-vcf

Smart VCF parser DSL
MIT License
82 stars 23 forks source link

Change delimiter for lists #31

Open mjafin opened 9 years ago

mjafin commented 9 years ago

Hi there, Thanks for bio-vcf, I've found it incredibly useful and fast. I may be missing something, but when I'm using tab-delimited output I sometimes have fields that can have zero, one or more comma delimited values, like for example under ANN. It looks like these are automatically tab-delimited in the output and therefore I can't predict the number of columns on each line ahead. Is there a way to change this behaviour so that within field delimiter would be, say, comma?

pjotrp commented 9 years ago

Can you give concrete examples? Also have a look at the source code. It is Ruby, and pretty easy to read.

mjafin commented 9 years ago

Hi @pjotrp, thanks for looking into this, happy to provide an example. Say I wanted to get tab-delimited output from vcf in the following way: zcat my.vcf.gz | bio-vcf --sfilter 'rec.filter=="PASS"' --eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,r.info.known,r.info.lof,r.info.simple_ann]' --seval '[s.pr]'

Then if all the fields contain one value, I'll have 9 columns. However, if e.g. the KNOWN field contains more than one value, separated by commas, there'll be more than 9 columns on those lines. In our case KNOWN is a list of genes affected by a structural variant. I'd like to have them comma separated in the tab-delimited output rather than tab-delimited within tab-delimited if that makes sense.

pjotrp commented 9 years ago

OK. Can you also paste a few relevant VCF lines?

mjafin commented 9 years ago

Certainly, the two below variants will have different numbers of fields due to one having more stuff in KNOWN

chr11   69465986        76      N       <DEL>   .       PASS    AC=0;AN=0;ANN=<DEL>|frameshift_variant&stop_lost|HIGH|CCND1|CCND1|transcript|NM_053056.2|Coding|5/5|c.825_*415del|p.Glu275fs|1034/4289|825/888|275/295||,<DEL>|3_prime_UTR_variant|MODIFIER|CCND1|CCND1|transcript|NM_053056.2|Coding|5/5|c.825_*415del||||||;CIEND=-352,10;CIEND95=-58,0;CIPOS=-6,296;CIPOS95=0,58;END=69466465;IMPRECISE;KNOWN=CCND1;LOF=(CCND1|CCND1|1|1.00);PE=6;SIMPLE_ANN=DEL|FRAMESHIFT_VARIANT&STOP_LOST|CCND1|NM_053056.2|,DEL|3_PRIME_UTR_VARIANT|CCND1|NM_053056.2|;SR=0;STRANDS=+-:6;SU=6;SVLEN=-479;SVTYPE=DEL     GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB      0/0:6:6:0:200.0:0.0:-3.0,-247.0,-825.0:830:828:1:828:1:0:0:828:1:0.0012
chr12   4382934 79      N       <DEL>   .       PASS    AC=0;AN=0;ANN=<DEL>|frameshift_variant&start_lost|HIGH|CCND2|CCND2|transcript|NM_001759.3|Coding|1/5|c.-272_180del|p.Met1fs|34/6522|1/870|1/289||,<DEL>|5_prime_UTR_variant|MODIFIER|CCND2|CCND2|transcript|NM_001759.3|Coding|1/5|c.-272_180del||||||;CIEND=-352,10;CIEND95=-73,0;CIPOS=-8,341;CIPOS95=0,73;END=4383386;IMPRECISE;KNOWN=CCND2,CCND2-AS1,CCND2-AS2;LOF=(CCND2|CCND2|1|1.00);PE=4;SIMPLE_ANN=DEL|FRAMESHIFT_VARIANT&START_LOST|CCND2|NM_001759.3|,DEL|5_PRIME_UTR_VARIANT|CCND2|NM_001759.3|;SR=0;STRANDS=+-:4;SU=4;SVLEN=-452;SVTYPE=DEL        GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB      0/0:4:4:0:200.0:0.0:-1.0,-175.0,-592.0:599:597:2:597:2:0:0:597:2:0.0033
pjotrp commented 9 years ago

OK, I see what you mean now. The thing is that magic happens at two levels. bio-vcf automatically realizes it is a list on parsing and splits that into an Array. That is really nice to have because you can refer to an indexed r.info.known[2], for example, in filters etc. When outputting the array, again, it knows it is a list and prints it out with tabs. The workaround it join the list again on output. Do something like:

--eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,r.info.known.join(","),r.info.lof,r.info.simple_ann]' 

And you are back to normal. Remember these are Ruby statements, so you can do anything. You could even create your own print statements. Above is the easy fix, for sure.

mjafin commented 9 years ago

Hi @pjotrp Thanks for the feedback again. I've tried the above .join(",") solution but it falls apart when there is only one entry (or no entries) as strings do not have a join-function.

pjotrp commented 9 years ago

I have not run this, but you could try a conditional on the type

--eval '[r.chr,r.ref,r.alt,r.info.svtype,r.info.end,(r.info.known.kind_of?(String) ? r.info.known : r.info.known.join(",")),r.info.lof,r.info.simple_ann]' 

it is kinda ugly and will fail when known is nil. Nil would warrant an extra conditional.

A more solid solution would to introduce a transformer. I'll keep your issue open until I come up with the 'right' solution.

mjafin commented 9 years ago

Thanks - I know ruby very little but could I write a little function before the brackets that would do the join (or not do the join) on multiple of the inputs, similarly checking if the input is an array, string or nil?

pjotrp commented 9 years ago

That can work. Ruby can be Lisp-like. Eval is a block, so you can define stuff. On a single line split commands with semi-colon. Does the following work?

--eval 'f=lambda{|in| ( in.kind_of?(String) ? in : in.join(",") } ; [f.call(r.info.known)]'

I just tried on one of the test files in the source tree the following:

bioruby-vcf$ cat test/data/input/somaticsniper.vcf |./bin/bio-vcf --eval 'f=lambda{|x| ( x.kind_of?(String) ? x : x.join(",")) } ; [f.call(r.tumor.dp4)]'

Resulting in

2,0,1,1
0,2,1,1
1,1,2,0
2,2,1,1
1,1,0,2
3,2,1,1
1,0,2,0

That worked.

A in-line transformer in fact. One thing I realise I have, but don't use here, is the VCF header information. So far, I don't use the header on purpose - not wanting to depend on faulty headers. But in this case the header information could give the necessary hint and return a list of one instead of a String. Hmm.

mjafin commented 9 years ago

Nice, thanks, I'll give that a go

mjafin commented 9 years ago

This seems to have done it for me: f=lambda{|x| ( x.nil? ? "" : (x.kind_of?(String) ? x : x.join(","))) } ;

chapmanb commented 9 years ago

Pjotr; Thanks for the workaround and all the suggestions. Is there any chance bio-vcf could make this easier internally? Ideally it would always return an array even with empty or items of size 1. Basing it on the header makes good sense. I agree that headers are often wrong but in this case it would help a lot to avoid having to special case everything. Thanks again.

pjotrp commented 9 years ago

Yeah. I was just trying to show how powerful the command line can be, even when tweaking default behaviour. I'll add this example to the docs.

I think we have to use the header and/or provide a special transformer.

The issue is listed as an 'enhancement' to be.

pjotrp commented 8 years ago

Related to this, sometime fields can come out as a nil, String or Array. Not so nice:

 --eval [r.pos,r.info.ann] --filter 'f=lambda{|x| ( x == nil ? "" : x.kind_of?(String) ? x : x.join(",")) } ; f.call(r.info.ann) =~ /PIK3CA|GATA3/'