openvax / varcode

Library for manipulating genomic variants and predicting their effects
Apache License 2.0
81 stars 26 forks source link

Double mutations in a MAF file cause error #105

Closed kippakers closed 8 years ago

kippakers commented 9 years ago

Scanning my logs, a good chunk of my failed jobs have something along these lines with a double mutation having an off-by-one location end:

ValueError: Expected variant 1:109461324 'GG' > 'TT' to end at 109461325 but got 109461326

This appears to be arising from varcode (I'm running topiary):

INFO:root:Building MHC binding prediction type for alleles ['HLA-A*30:01', 'HLA-A*02:01', 'HLA-B*38:01', 'HLA-B*48:01', 'HLA-C*08:03', 'HLA-C*12:03'] and epitope lengths [9]
INFO:root:netMHCcons finished with return code 0
INFO:root:netMHCcons took 0.0690 seconds
Traceback (most recent call last):
  File "/hpc/users/akersn01/.local/bin/topiary", line 5, in <module>
    pkg_resources.run_script('topiary==0.0.6', 'topiary')
  File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 461, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 1194, in run_script
    execfile(script_filename, namespace, namespace)
  File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 99, in <module>
    main()
  File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 59, in main
    variants = variant_collection_from_args(args)
  File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/topiary/args.py", line 61, in variant_collection_from_args
    variant_collections.append(varcode.load_maf(maf_path))
  File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/varcode-0.3.12-py2.7.egg/varcode/maf.py", line 163, in load_maf
    end_pos))

here is the offending line of the maf:

GPSM2   29899   broad.mit.edu   37  1   109461324   109461326   +   Missense_Mutation   DNP GG  TT  TT          TCGA-BF-A1Q0-01A-21D-A19A-08    TCGA-BF-A1Q0-10A-01D-A19A-08                                Untested    Somatic Phase_I WXS none            Illumina GAIIx  a8597b25-8541-43e0-b46c-e54e2eaca473    9ae35461-c8d7-4a2e-88b9-8bf14458a975    g.chr1:109461324_109461326GG>TT uc010ovc.2  +   11  1849_1851   c.1353_1355GG>TT    c.(1351-1356)aagggg>aaTTg   p.451_452KG>N   AKNAD1_uc010ovb.2_Intron|GPSM2_uc010ovd.2_Missense_Mutation_p.451_452KG>N|GPSM2_uc010ove.1_Missense_Mutation_p.451_452KG>N  NM_013296   NP_037428   P81274  GPSM2_HUMAN Homo sapiens G-protein signaling modulator 2 (GPSM2), mRNA. 451                 G-protein coupled receptor protein signaling pathway    cell cortex|nucleus GTPase activator activity|identical protein binding         breast(2)|central_nervous_system(1)|endometrium(1)|kidney(1)|large_intestine(4)|liver(2)|lung(3)    14      all_epithelial(167;7.64e-05)|all_lung(203;0.000321)|Lung NSC(277;0.000626)  Colorectal(144;0.0353)|Lung(183;0.0984)|COAD - Colon adenocarcinoma(174;0.129)|Epithelial(280;0.175)|all cancers(265;0.209)     ACAGACTGAAGGGGAAAAAATAC 0.374000    495         14      0   0   6.4e-05 0   0
iskandr commented 9 years ago

Thanks for the bug report! I'll add this to the Varcode unit tests and then try to figure out why it happens.

On Mon, Jul 20, 2015 at 10:18 AM, kippakers notifications@github.com wrote:

Scanning my logs, a good chunk of my failed jobs have something along these lines with a double mutation having an off-by-one location end:

ValueError: Expected variant 1:109461324 'GG' > 'TT' to end at 109461325 but got 109461326

This appears to be arising from varcode (I'm running topiary):

INFO:root:Building MHC binding prediction type for alleles ['HLA-A_30:01', 'HLA-A_02:01', 'HLA-B_38:01', 'HLA-B_48:01', 'HLA-C_08:03', 'HLA-C_12:03'] and epitope lengths [9] INFO:root:netMHCcons finished with return code 0 INFO:root:netMHCcons took 0.0690 seconds Traceback (most recent call last): File "/hpc/users/akersn01/.local/bin/topiary", line 5, in pkg_resources.run_script('topiary==0.0.6', 'topiary') File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 461, in run_script self.require(requires)[0].run_script(script_name, ns) File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 1194, in run_script execfile(script_filename, namespace, namespace) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 99, in main() File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 59, in main variants = variant_collection_from_args(args) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/topiary/args.py", line 61, in variant_collection_from_args variant_collections.append(varcode.load_maf(maf_path)) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/varcode-0.3.12-py2.7.egg/varcode/maf.py", line 163, in load_maf end_pos))

here is the offending line of the maf:

GPSM2 29899 broad.mit.edu 37 1 109461324 109461326 + Missense_Mutation DNP GG TT TT TCGA-BF-A1Q0-01A-21D-A19A-08 TCGA-BF-A1Q0-10A-01D-A19A-08 Untested Somatic Phase_I WXS none Illumina GAIIx a8597b25-8541-43e0-b46c-e54e2eaca473 9ae35461-c8d7-4a2e-88b9-8bf14458a975 g.chr1:109461324_109461326GG>TT uc010ovc.2 + 11 1849_1851 c.1353_1355GG>TT c.(1351-1356)aagggg>aaTTg p.451_452KG>N AKNAD1_uc010ovb.2_Intron|GPSM2_uc010ovd.2_Missense_Mutation_p.451_452KG>N|GPSM2_uc010ove.1_Missense_Mutation_p.451_452KG>N NM_013296 NP_037428 P81274 GPSM2_HUMAN Homo sapiens G-protein signaling modulator 2 (GPSM2), mRNA. 451 G-protein coupled receptor protein signaling pathway cell cortex|nucleus GTPase activator activity|identical protein binding breast(2)|central_nervous_system(1)|endometrium(1)|kidney(1)|large_intestine(4)|liver(2)|lung(3) 14 a ll_epithelial(167;7.64e-05)|all_lung(203;0.000321)|Lung NSC(277;0.000626) Colorectal(144;0.0353)|Lung(183;0.0984)|COAD - Colon adenocarcinoma(174;0.129)|Epithelial(280;0.175)|all cancers(265;0.209) ACAGACTGAAGGGGAAAAAATAC 0.374000 495 14 0 0 6.4e-05 0 0

— Reply to this email directly or view it on GitHub https://github.com/hammerlab/varcode/issues/105.

iskandr commented 8 years ago

Looking at this again, are you sure the MAF is right here? If the mutation starts at 109461324 and affects two nucleotides, shouldn't it end at 109461325? I thought the coordinates here were base-1 inclusive.

End_Position: Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system).

(https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification)

kippakers commented 8 years ago

Yeah, that logic makes sense to me. #115 is a good idea though, because this seems to happen with some frequency. There must be a bug somewhere in the TCGA MAF generating pipeline.