uab-cgds-worthey / DITTO

Variant Deleteriousness prediction tool using AI
GNU General Public License v3.0
1 stars 0 forks source link

VEP output parsing - [merged] #12

Closed ManavalanG closed 1 year ago

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Jan 31, 2021, 15:38

_Merges vep_outputparsing -> master

A simple, no frills, parser for taking VEP annotated VCFs and parsing them into a TSV format for easier downstream use. This includes review of the following:

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 11:06

Commented on annotation_parsing/README.md line 26

Looks like we need to load the anaconda module before running the script on cheaha or it doesn't work (at least for me :p). module load Anaconda3/2020.02

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 11:08

Commented on annotation_parsing/README.md line 28

Can we please have all the files in the same place instead of different directories? Sorry, I didn't think about this in the previous MR.

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 15:19

Commented on annotation_parsing/parse_annotated_vars.py line 113

Looks like there is an error when running clinvar variants. image

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 15:54

Commented on annotation_parsing/README.md line 28

we have an issue #3 open for this exact thing already :grin: for now best to get things in and consolidate later once we have a clear picture of everything that will be in the project.

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 15:55

Commented on annotation_parsing/README.md line 28

Gotcha!

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 16:09

Commented on annotation_parsing/README.md line 26

This is a Cheaha specific issue. By default Cheaha's version of python that gets loaded when you start an interactive shell is Python 2.7.5. When you load and init Anaconda3 its base is a Python 3.7.5 version, which is why you had this issue.

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 16:24

Commented on annotation_parsing/parse_annotated_vars.py line 113

changed this line in version 2 of the diff

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 16:24

added 1 commit

Compare with previous version

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 16:24

Commented on annotation_parsing/parse_annotated_vars.py line 113

ok I've pushed up the fix, give it a go and let me know if it works out now.

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:02

Commented on annotation_parsing/README.md line 26

okay

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:03

Commented on annotation_parsing/parse_annotated_vars.py line 113

It's working now!

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:03

resolved all threads

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:05

marked the checklist item README provided with the parser as completed

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:05

marked the checklist item Review of the parser code as completed

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 1, 2021, 17:05

marked the checklist item Review of the test VEP annotated VCF and the corresponding output format as completed

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 1, 2021, 20:42

As a note @tkmamidi asked for some clarification on one of the output columns from the parsing: Question: Alternate Allele & VEP_Allele_Identifier; how are these different?

Brandon Wilk 10:47 AM :smile:

VEP's output format for multi-allelic lines in the case of insertions, deletions, and indels is quite dumb IMO.

For each set of annotations VEP lists the allele the annotations are associated with but it does not always have the same format as the Alt allele listed in the VCF. So to be transparent (and also help check my work lol) I have the alt allele listed by VEP as a column to allow for back-tracking from the parsed TSV to the crap in the VEP annotated VCF

for example consider this variant annotated by VEP:

1       19631483        .       CTT     C       18.74   PASS    FS=0;MQ=238.5;QD=9.37;SOR=1.609;FractionInformativeReads=0.5;DP=2;AF=1;AN=2;AC=2;CSQ=-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||5/5|NM_001320979.1:c.814-605_814-604del|||||||||-1||EntrezGene||rseq_mrna_match||TT|TT||||1.643|-0.076211||||||||||||||,-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||6/6|NM_003689.3:c.919-605_919-604del|||||||||-1||EntrezGene||rseq_mrna_match||TT|TT||||1.643|-0.076211||||||||||||||

the key CSQ in the info column is the VEP annotated info separated by pipes

multiple transcripts worth of information is separated by commas

in this example there are two transcripts

-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||5/5|NM_001320979.1:c.814-605_814-604del|||||||||-1||EntrezGene||rseq_mrna_match||TT|TT||||1.643|-0.076211||||||||||||||

and

-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||6/6|NM_003689.3:c.919-605_919-604del|||||||||-1||EntrezGene||rseq_mrna_match||TT|TT||||1.643|-0.076211||||||||||||||

the first column of each of those specifies the alt allele that the annotation info belongs to

as you can see it's a - here because this variant is a deletion

no big deal since there's only one variant listed here, but still annoying

Tarun Mamidi 10:58 AM: Gotcha! Thanks for the explanation :slightly_smiling_face:

Brandon Wilk 10:59 AM: well, it gets worse :joy: when you get to lines like this:

1       19633106        rs72255348      AT      ATT,ATTT,A,ATTTT        83.31   DRAGENHardQUAL  FS=0;MQ=240.9;QD=3.23;SOR=2.303;FractionInformativeReads=0.667;DB;MQRankSum=-0.691;ReadPosRankSum=1.678;R2_5P_bias=0;DP=1317;AF=1,0.5,0.5,0.5;AN=384;AC=338,3,1,1;CSQ=TT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||4/5|NM_001320979.1:c.683+389dup|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||1.026|-0.146843|-0.95|rs3835240|25636|27700|9.25487e-01|||||||||,TTT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||4/5|NM_001320979.1:c.683+389_683+390insAA|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||0.999|-0.150858|-0.95|rs3835240|31|27700|1.11913e-03|||||||||,-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||4/5|NM_001320979.1:c.683+389del|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||0.967|-0.155699|-0.95|||||||||||||,TTTT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_001320979.1|protein_coding||4/5|NM_001320979.1:c.683+389_683+390insAAA|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||||-0.95|||||||||||||,TT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||5/6|NM_003689.3:c.788+389dup|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||1.026|-0.146843|-0.95|rs3835240|25636|27700|9.25487e-01|||||||||,TTT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||5/6|NM_003689.3:c.788+389_788+390insAA|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||0.999|-0.150858|-0.95|rs3835240|31|27700|1.11913e-03|||||||||,-|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||5/6|NM_003689.3:c.788+389del|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||0.967|-0.155699|-0.95|||||||||||||,TTTT|intron_variant|MODIFIER|AKR7A2|8574|Transcript|NM_003689.3|protein_coding||5/6|NM_003689.3:c.788+389_788+390insAAA|||||||||-1||EntrezGene||rseq_mrna_match||T|T||||||-0.95|||||||||||||

which is just no fun :sob:

so I left it to reduce ambiguity

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 2, 2021, 13:06

approved this merge request

ManavalanG commented 3 years ago

In GitLab by @tkmamidi on Feb 2, 2021, 13:06

marked this merge request as ready

ManavalanG commented 3 years ago

In GitLab by @wilkb777 on Feb 2, 2021, 13:22

mentioned in commit dd06512484968654d59c31ec3f83a44f29b9c43f