konradjk / loftee

MIT License
174 stars 55 forks source link

KeyError: 'start_retained_variant' #28

Closed nli8888 closed 6 years ago

nli8888 commented 6 years ago

When running the following command:

python /data/Install/LOFTEE/loftee-master/src/tableize_vcf.py --vcf /data/Share/nick/Paralog_Anno/data_files/test.out_paraloc --out /data/Share/nick/Paralog_Anno/data_files/test.out_paraloc_tableized --vep_info Amino_acids,Codons,Paralogue_Vars

I get this error:

WARNING: Did not find minimal_representation. Outputting raw positions.
SUCCESS: Found bgzip! Will bgzip the table.
SUCCESS: Found Amino_acids
SUCCESS: Found Codons
SUCCESS: Found Paralogue_Vars
14. FAILED ON LINE: 14    65077986    404110    CATATACTGGAT    C    .    .    ALLELEID=399858;CLNDISDB=MedGen:C1708353,Orphanet:ORPHA29072;CLNDN=Hereditary_Paraganglioma-Pheochromocytoma_Syndromes;CLNHGVS=NC_000014.9:g.65077987_65077997delATATACTGGAT;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Pathogenic;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=MAX:4149;MC=SO:0001589|frameshift_variant,SO:0001627|intron_variant;ORIGIN=1;RS=1060500101;CSQ=-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000284165|protein_coding|4/4||||360-370|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000341653|protein_coding||3/3|||||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358402|protein_coding|3/4||||349-359|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358664|protein_coding|4/5||||342-352|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000394606|nonsense_mediated_decay|4/6||||391-401|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000553928|nonsense_mediated_decay|4/6||||232-242|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|non_coding_transcript_exon_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000553951|retained_intron|3/3||||288-298||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555419|protein_coding|3/4||||103-113|103-113|35-38|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555667|protein_coding|3/4||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000555932|protein_coding||1/1|||||||||1||-1||HGNC|HGNC:6913|,-|upstream_gene_variant|MODIFIER|AL139022.1|ENSG00000259118|Transcript|ENST00000556127|antisense_RNA|||||||||||1|4037|1||Clone_based_ensembl_gene||,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556443|protein_coding|3/3||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|start_retained_variant&5_prime_UTR_variant|LOW|MAX|ENSG00000125952|Transcript|ENST00000556892|protein_coding|3/4||||335-345|?-2|?-1||||1||-1|cds_end_NF|HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556979|protein_coding|4/5||||389-399|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|5_prime_UTR_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000557277|protein_coding|4/6||||357-367||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000557746|protein_coding|3/5||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000618858|protein_coding|4/6||||416-426|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|
Traceback (most recent call last):
 File "/data/Install/LOFTEE/loftee-master/src/tableize_vcf.py", line 396, in <module>
   main(args)
 File "/data/Install/LOFTEE/loftee-master/src/tableize_vcf.py", line 346, in main
   raise e
KeyError: 'start_retained_variant'

The error prevents tableize from moving on to any variants following the one that caused it and crashes out.

test.out_paraloc contains the following, specifically, the first variant is parsed fine and then the second one causes the error:

##fileformat=VCFv4.1
##VEP="v90" time="2018-05-15 03:04:05" cache="/data/Share/nick/Paralog_Anno/homo_sapiens/90_GRCh38" ensembl-funcgen=90.743f32b ensembl-variation=90.58bf949 ensembl=90.4a44397 ensembl-io=90.9a148ea 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="V2-SSA137" HGMD-PUBLIC="20164" assembly="GRCh38.p10" dbSNP="150" gencode="GENCODE 27" genebuild="2014-07" gnomAD="170228" polyphen="2.2.2" regbuild="16" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|Paralogue_Vars">
##Paralogue_Vars=Equivalant variants and locations in paralogous genes
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
14  65077985    29786   G   A   .   .   ALLELEID=38741;CLNDISDB=MedGen:C0027672,SNOMED_CT:699346009|MedGen:C3149711;CLNDN=Hereditary_cancer-predisposing_syndrome|Pheochromocytoma,_susceptibility_to;CLNHGVS=NC_000014.9:g.65077985G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Pathogenic,_risk_factor;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:154950.0002;GENEINFO=MAX:4149;MC=SO:0001587|nonsense,SO:0001623|5_prime_UTR_variant,SO:0001627|intron_variant;ORIGIN=1;RS=387906650;CSQ=A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000284165|protein_coding|4/4||||372|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000341653|protein_coding||3/3|||||||||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358402|protein_coding|3/4||||361|196|66|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358664|protein_coding|4/5||||354|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000394606|nonsense_mediated_decay|4/6||||403|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000553928|nonsense_mediated_decay|4/6||||244|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|non_coding_transcript_exon_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000553951|retained_intron|3/3||||300||||||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555419|protein_coding|3/4||||115|115|39|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555667|protein_coding|3/4||||374|196|66|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000555932|protein_coding||1/1|||||||||1||-1||HGNC|HGNC:6913|,A|upstream_gene_variant|MODIFIER|AL139022.1|ENSG00000259118|Transcript|ENST00000556127|antisense_RNA|||||||||||1|4049|1||Clone_based_ensembl_gene||,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556443|protein_coding|3/3||||374|196|66|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556892|protein_coding|3/4||||347|4|2|R/*|Cga/Tga||1||-1|cds_end_NF|HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556979|protein_coding|4/5||||401|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|5_prime_UTR_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000557277|protein_coding|4/6||||369||||||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000557746|protein_coding|3/5||||374|196|66|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|,A|stop_gained|HIGH|MAX|ENSG00000125952|Transcript|ENST00000618858|protein_coding|4/6||||428|223|75|R/*|Cga/Tga||1||-1||HGNC|HGNC:6913|
14  65077986    404110  CATATACTGGAT    C   .   .   ALLELEID=399858;CLNDISDB=MedGen:C1708353,Orphanet:ORPHA29072;CLNDN=Hereditary_Paraganglioma-Pheochromocytoma_Syndromes;CLNHGVS=NC_000014.9:g.65077987_65077997delATATACTGGAT;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Pathogenic;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=MAX:4149;MC=SO:0001589|frameshift_variant,SO:0001627|intron_variant;ORIGIN=1;RS=1060500101;CSQ=-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000284165|protein_coding|4/4||||360-370|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000341653|protein_coding||3/3|||||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358402|protein_coding|3/4||||349-359|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000358664|protein_coding|4/5||||342-352|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000394606|nonsense_mediated_decay|4/6||||391-401|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant&NMD_transcript_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000553928|nonsense_mediated_decay|4/6||||232-242|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|non_coding_transcript_exon_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000553951|retained_intron|3/3||||288-298||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555419|protein_coding|3/4||||103-113|103-113|35-38|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000555667|protein_coding|3/4||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|intron_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000555932|protein_coding||1/1|||||||||1||-1||HGNC|HGNC:6913|,-|upstream_gene_variant|MODIFIER|AL139022.1|ENSG00000259118|Transcript|ENST00000556127|antisense_RNA|||||||||||1|4037|1||Clone_based_ensembl_gene||,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556443|protein_coding|3/3||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|start_retained_variant&5_prime_UTR_variant|LOW|MAX|ENSG00000125952|Transcript|ENST00000556892|protein_coding|3/4||||335-345|?-2|?-1||||1||-1|cds_end_NF|HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000556979|protein_coding|4/5||||389-399|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|5_prime_UTR_variant|MODIFIER|MAX|ENSG00000125952|Transcript|ENST00000557277|protein_coding|4/6||||357-367||||||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000557746|protein_coding|3/5||||362-372|184-194|62-65|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|,-|frameshift_variant|HIGH|MAX|ENSG00000125952|Transcript|ENST00000618858|protein_coding|4/6||||416-426|211-221|71-74|IQYM/X|ATCCAGTATATg/g||1||-1||HGNC|HGNC:6913|

Version of Python used was Python 2.7.6

konradjk commented 6 years ago

VEP appears to have added a new start_retained_variant annotation. I've added that to the list of annotations - if you pull the latest version and try again, does that work?

nli8888 commented 6 years ago

I get a new error now:

WARNING: Did not find minimal_representation. Outputting raw positions.
SUCCESS: Found bgzip! Will bgzip the table.
SUCCESS: Found Amino_acids
SUCCESS: Found Codons
SUCCESS: Found Paralogue_Vars
14.Traceback (most recent call last):
  File "/data/Share/nick/Paralog_Anno/loftee/src/tableize_vcf.py", line 200, in main
    info_field = dict([(x.split('=', 1)) if '=' in x else (x, x) for x in re.split(';(?=\w)', fields[header['INFO']].replace('"', ''))])
IndexError: list index out of range
FAILED ON LINE: 
Traceback (most recent call last):
  File "/data/Share/nick/Paralog_Anno/loftee/src/tableize_vcf.py", line 431, in <module>
    main(args)
  File "/data/Share/nick/Paralog_Anno/loftee/src/tableize_vcf.py", line 379, in main
    raise e
IndexError: list index out of range
konradjk commented 6 years ago

Hmm, that looks odd. Do you maybe have a blank line at the end of your file (or somewhere in it)?

nli8888 commented 6 years ago

Ah yes I do, at the end of the file there was an empty new line. So for the record, any future input files should be checked for empty new lines and removed.

Many thanks, Konrad!