counts variant consequences per gene

Hello,

I have to extract numbers of variant consequences per each gene (lets say desired output could be something like this: MTHFR: missense 50, intron 300, synonymous 200, upstream 40) Two types of input VEP annotated files: 1) regular VCF (gzipped) 2) gnomad genome v3.1 file (bgzipped, 10 - 100 GB in size) The problem with these VEP files is - numbers of VEP fields and pipes are not regular; also gnomad file contains many fields, separated with ";"

research file example: chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000335137 Transcript synonymous_variant 216 180 60 S tcA/tcG - IMPACT=LOW;STRAND=1

gnomad example: chrY 2893551 . TTTTA T . AC0 AC=0;AN=33443;AF=0.00000;AC_non_neuro_nfe=0;AN_non_neuro_nfe=12437;AF_non_neuro_nfe=0.00000;nhomalt_non_neuro_nfe=0;AC_non_neuro_afr_XY=0;AN_non_neuro_afr_XY=6191;AF_non_neuro_afr_XY=0.00000;nhomalt_non_neuro_afr_XY=0;AC_non_neuro_nfe_XY=0;AN_non_neuro_nfe_XY=12437;AF_non_neuro_nfe_XY=0.00000;nhomalt_non_neuro_nfe_XY=0;AC_controls_and_biobanks_eas_XY=0;gq_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;gq_hist_all_bin_freq=0|0|0|0|23920|6612|2240|503|108|48|10|1|1|0|0|0|0|0|0|0;dp_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;dp_hist_alt_n_larger=0;dp_hist_all_bin_freq=0|0|9449|16312|6639|911|120|10|2|0|0|0|0|0|0|0|0|0|0|0;dp_hist_all_n_smaller=0;dp_hist_all_n_larger=0;ab_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;cadd_raw_score=-0.246556;cadd_phred=0.4050

Both files contain headers and column names.

I wrote Python code for the first type of file, but with this large and unconsistent gnomAD VCF I ran into issues.

Is it possible to get variant consequence counts per each gene with this Java kit? I looked into utilities list, but did not identify any straight-forward solution. However, it should somehow be possible with jvarkit.

Thank you!

lindenb / jvarkit

counts variant consequences per gene #191