lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
478 stars 132 forks source link

counts variant consequences per gene #191

Closed Kaskere closed 2 years ago

Kaskere commented 2 years ago

Hello,

I have to extract numbers of variant consequences per each gene (lets say desired output could be something like this: MTHFR: missense 50, intron 300, synonymous 200, upstream 40) Two types of input VEP annotated files: 1) regular VCF (gzipped) 2) gnomad genome v3.1 file (bgzipped, 10 - 100 GB in size) The problem with these VEP files is - numbers of VEP fields and pipes are not regular; also gnomad file contains many fields, separated with ";"

research file example: chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000335137 Transcript synonymous_variant 216 180 60 S tcA/tcG - IMPACT=LOW;STRAND=1

gnomad example: chrY 2893551 . TTTTA T . AC0 AC=0;AN=33443;AF=0.00000;AC_non_neuro_nfe=0;AN_non_neuro_nfe=12437;AF_non_neuro_nfe=0.00000;nhomalt_non_neuro_nfe=0;AC_non_neuro_afr_XY=0;AN_non_neuro_afr_XY=6191;AF_non_neuro_afr_XY=0.00000;nhomalt_non_neuro_afr_XY=0;AC_non_neuro_nfe_XY=0;AN_non_neuro_nfe_XY=12437;AF_non_neuro_nfe_XY=0.00000;nhomalt_non_neuro_nfe_XY=0;AC_controls_and_biobanks_eas_XY=0;gq_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;gq_hist_all_bin_freq=0|0|0|0|23920|6612|2240|503|108|48|10|1|1|0|0|0|0|0|0|0;dp_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;dp_hist_alt_n_larger=0;dp_hist_all_bin_freq=0|0|9449|16312|6639|911|120|10|2|0|0|0|0|0|0|0|0|0|0|0;dp_hist_all_n_smaller=0;dp_hist_all_n_larger=0;ab_hist_alt_bin_freq=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;cadd_raw_score=-0.246556;cadd_phred=0.4050

Both files contain headers and column names.

I wrote Python code for the first type of file, but with this large and unconsistent gnomAD VCF I ran into issues.

Is it possible to get variant consequence counts per each gene with this Java kit? I looked into utilities list, but did not identify any straight-forward solution. However, it should somehow be possible with jvarkit.

Thank you!

lindenb commented 2 years ago

you could start with something like this: bcftools query -f '%INFO/vep\n genomes/gnomad.genomes.r2.1.sites.vcf.gz | tr "," "\n" | cut -d '|' -f2,4

and then group with something like datamash

I'm closing this as it's unrelated to jvarkit.