Documenting a parseVEP bug

lydiayliu commented 3 years ago

Following code for launching MoPepGen:

sudo docker run -it --rm -v /data/ref/GRCh38/ref_star/:/genome/ -v /data/resources/proteomics/hsapiens/GENCODE/GRCh38.13/release_34/:/proteome/ -v /data/resources/transcriptomics/hsapiens/GENCODE/GRCh38.p13/release_34/:/transcriptome/ -v /data/projects/cpcgene/noncanonical_peptides/Mutation/gencodev34_grch38/VEP/:/VEP/ -v /data/users/yiyangliu/MoPepGen/:/data/ continuumio/miniconda3 /bin/bash

conda create --name mopepgen --yes
conda activate mopepgen
pip install /data/private-moPepGen-main.zip

a=/VEP/germline/filtered_indel/CPCG0183.gencode.aa.tsv

moPepGen parseVEP \
    --vep-txt ${a} \
    --index-dir /data/GRCh38_Gencodev34/ \
    --output-prefix /data/Parser/VEP/gencode.aa/germline_indel/test
done

Gives bug:

/VEP/germline/filtered_indel/CPCG0183.gencode.aa.tsv
[ 2021-07-13 19:00:25 ] moPepGen parseVEP started.
[ 2021-07-13 19:01:25 ] Indexed genome and annotation loaded.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/parser/VEPParser.py", line 162, in convert_to_variant_record
    return seqvar.VariantRecord(
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/seqvar/VariantRecord.py", line 62, in __init__
    raise ValueError("Length of ref must match with location.")
ValueError: Length of ref must match with location.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 350, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/cli/parse_vep.py", line 62, in parse_vep
    record = record.convert_to_variant_record(transcript_seq)
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/parser/VEPParser.py", line 175, in convert_to_variant_record
    raise ValueError(e.args[0] + f' [{self.feature}]') from e
ValueError: Length of ref must match with location. [ENST00000419434.5]

VEP entry producing the problem:

rs143784264 chr17:29969183-29969185 - ENSG00000176927.15 ENST00000419434.5 Transcript start_lost,inframe_deletion 1-3 1-3 1 K/- AAG/- - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

I believe this is an edge case due to the the deletion being at the very start of the transcript.

In VEPParser.py, line 132, this results in the alt_start being -1.

lydiayliu commented 3 years ago

CPCG0346.gencode.aa.tsv
[ 2021-07-13 20:07:16 ] moPepGen parseVEP started.
[ 2021-07-13 20:08:18 ] Indexed genome and annotation loaded.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/parser/VEPParser.py", line 162, in convert_to_variant_record
    return seqvar.VariantRecord(
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/seqvar/VariantRecord.py", line 62, in __init__
    raise ValueError("Length of ref must match with location.")
ValueError: Length of ref must match with location.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 350, in main
    args.func(args)
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/cli/parse_vep.py", line 62, in parse_vep
    record = record.convert_to_variant_record(transcript_seq)
  File "/opt/conda/lib/python3.8/site-packages/moPepGen/parser/VEPParser.py", line 175, in convert_to_variant_record
    raise ValueError(e.args[0] + f' [{self.feature}]') from e
ValueError: Length of ref must match with location. [ENST00000536027.1]

VEP entry

chr16_71922627_ATGCCT/- chr16:71922627-71922632 - ENSG00000182149.21 ENST00000536027.1 Transcript start_lost,inframe_deletion,NMD_transcript_variant 1-6 1-6 1-2 MP/- ATGCCT/- - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

zhuchcn commented 3 years ago

That is an edge case that I didn't consider. Thanks for reporting it. I'll fix it once the rMATS parser branch is merge. Because some code in that branch could be useful.

uclahs-cds / package-moPepGen

Documenting a parseVEP bug #21