Closed kippakers closed 8 years ago
Thanks for the bug report! I'll add this to the Varcode unit tests and then try to figure out why it happens.
On Mon, Jul 20, 2015 at 10:18 AM, kippakers notifications@github.com wrote:
Scanning my logs, a good chunk of my failed jobs have something along these lines with a double mutation having an off-by-one location end:
ValueError: Expected variant 1:109461324 'GG' > 'TT' to end at 109461325 but got 109461326
This appears to be arising from varcode (I'm running topiary):
INFO:root:Building MHC binding prediction type for alleles ['HLA-A_30:01', 'HLA-A_02:01', 'HLA-B_38:01', 'HLA-B_48:01', 'HLA-C_08:03', 'HLA-C_12:03'] and epitope lengths [9] INFO:root:netMHCcons finished with return code 0 INFO:root:netMHCcons took 0.0690 seconds Traceback (most recent call last): File "/hpc/users/akersn01/.local/bin/topiary", line 5, in
pkg_resources.run_script('topiary==0.0.6', 'topiary') File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 461, in run_script self.require(requires)[0].run_script(script_name, ns) File "/hpc/packages/minerva-common/py_packages/2.7/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 1194, in run_script execfile(script_filename, namespace, namespace) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 99, in main() File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/EGG-INFO/scripts/topiary", line 59, in main variants = variant_collection_from_args(args) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/topiary-0.0.6-py2.7.egg/topiary/args.py", line 61, in variant_collection_from_args variant_collections.append(varcode.load_maf(maf_path)) File "/hpc/users/akersn01/.local/lib/python2.7/site-packages/varcode-0.3.12-py2.7.egg/varcode/maf.py", line 163, in load_maf end_pos)) here is the offending line of the maf:
GPSM2 29899 broad.mit.edu 37 1 109461324 109461326 + Missense_Mutation DNP GG TT TT TCGA-BF-A1Q0-01A-21D-A19A-08 TCGA-BF-A1Q0-10A-01D-A19A-08 Untested Somatic Phase_I WXS none Illumina GAIIx a8597b25-8541-43e0-b46c-e54e2eaca473 9ae35461-c8d7-4a2e-88b9-8bf14458a975 g.chr1:109461324_109461326GG>TT uc010ovc.2 + 11 1849_1851 c.1353_1355GG>TT c.(1351-1356)aagggg>aaTTg p.451_452KG>N AKNAD1_uc010ovb.2_Intron|GPSM2_uc010ovd.2_Missense_Mutation_p.451_452KG>N|GPSM2_uc010ove.1_Missense_Mutation_p.451_452KG>N NM_013296 NP_037428 P81274 GPSM2_HUMAN Homo sapiens G-protein signaling modulator 2 (GPSM2), mRNA. 451 G-protein coupled receptor protein signaling pathway cell cortex|nucleus GTPase activator activity|identical protein binding breast(2)|central_nervous_system(1)|endometrium(1)|kidney(1)|large_intestine(4)|liver(2)|lung(3) 14 a ll_epithelial(167;7.64e-05)|all_lung(203;0.000321)|Lung NSC(277;0.000626) Colorectal(144;0.0353)|Lung(183;0.0984)|COAD - Colon adenocarcinoma(174;0.129)|Epithelial(280;0.175)|all cancers(265;0.209) ACAGACTGAAGGGGAAAAAATAC 0.374000 495 14 0 0 6.4e-05 0 0
— Reply to this email directly or view it on GitHub https://github.com/hammerlab/varcode/issues/105.
Looking at this again, are you sure the MAF is right here? If the mutation starts at 109461324
and affects two nucleotides, shouldn't it end at 109461325
? I thought the coordinates here were base-1 inclusive.
End_Position: Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system).
(https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification)
Yeah, that logic makes sense to me. #115 is a good idea though, because this seems to happen with some frequency. There must be a bug somewhere in the TCGA MAF generating pipeline.
Scanning my logs, a good chunk of my failed jobs have something along these lines with a double mutation having an off-by-one location end:
This appears to be arising from varcode (I'm running topiary):
here is the offending line of the maf: