matsengrp / cft

Clonal family tree
5 stars 3 forks source link

process_partis.py crashes while processing QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv #48

Closed cswarth closed 7 years ago

cswarth commented 7 years ago

Attempting to process partis output file in /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/

process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-Vh/Hs-LN1-5RACE-IgG-new --separate --chain h

Some additional debugging statements added.

+ process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-Vh/Hs-LN1-5RACE-IgG-new --separate --chain h
glfo = <type 'dict'>
region = v
line[v_gene] = IGHV1-18*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = d
line[d_gene] = IGHD1-7*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = j
line[j_gene] = IGHJ6*03
glfo['seqs'].keys() = ['j', 'd', 'v']
region = v
line[v_gene] = IGHV1-18*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = d
line[d_gene] = IGHD5-12*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = j
line[j_gene] = IGHJ4*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = v
line[v_gene] = IGHV1-18*1m
glfo['seqs'].keys() = ['j', 'd', 'v']
Traceback (most recent call last):
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 293, in <module>
    main()
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 274, in main
    args.chain)
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 145, in process_data
    calculate_bounds(cluster.to_dict(), glfo)
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 101, in calculate_bounds
    utils.add_implicit_info(glfo, line)
  File "/home/cwarth/src/matsen/partis/python/utils.py", line 831, in add_implicit_info
    uneroded_gl_seq = glfo['seqs'][region][line[region + '_gene']]
KeyError: 'IGHV1-18*1m'
cswarth commented 7 years ago

process_partis also crashes on these files, although for various reasons.

process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-Vh/Hs-LN1-5RACE-IgG-new --separate --chain "h"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.016-VL/Hs-LN3-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.016-VL/Hs-LN3-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.016-VL/Hs-LN3-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.016-VL/Hs-LN2-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.016-VL/Hs-LN2-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.016-VL/Hs-LN2-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.026-VL/Hs-LN1-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.026-VL/Hs-LN1-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.026-VL/Hs-LN1-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.026-VL/Hs-LN4-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.026-VL/Hs-LN4-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.026-VL/Hs-LN4-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.417-VL/Hs-LN1-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.417-VL/Hs-LN1-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.417-VL/Hs-LN1-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.417-VL/Hs-LN4-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.417-VL/Hs-LN4-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.417-VL/Hs-LN4-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-VL/Hs-LN1-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-VL/Hs-LN1-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-VL/Hs-LN1-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-VL/Hs-LN4-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-VL/Hs-LN4-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-VL/Hs-LN4-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-VL/Hs-LN3-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-VL/Hs-LN3-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.067-VL/Hs-LN3-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-VL/Hs-LN2-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-VL/Hs-LN2-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.067-VL/Hs-LN2-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.006-VL/Hs-LN3-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.006-VL/Hs-LN3-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.006-VL/Hs-LN3-5RACE-IgL-100k --separate --chain "L"
process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.006-VL/Hs-LN2-5RACE-IgL-100k-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.006-VL/Hs-LN2-5RACE-IgL-100k.csv --cluster_base cluster --output_dir postpartis_out/QA255.006-VL/Hs-LN2-5RACE-IgL-100k --separate --chain "L"
dawahs commented 7 years ago

For the first problem file, the partis utils only searches through known V genes in the IMGT database and IGHV1-18*1m is not one of these. We had this issue before with a V gene of the form IGHV1-18+C1M or some such, which the line

line['v_gene'] = line['v_gene'].split('+')[0]

is used to guard against, but perhaps we need a better way to work around the cases where partis will infer a gene that does not have an exact match in the IMGT database. For example, I assume IGHV1-18*1m should be changed to IGHV1-18*01, but is there some way to guess exactly all the cases that partis will modify a gene's IMGT tag?

Posting for posterity, but I'll look into it.

matsen commented 7 years ago

@dawahs this could be partis-inferred genes, which should be stored as part of the output.

cc @psathyrella

psathyrella commented 7 years ago

I can't see the whole context, but in general the problem is probably: if you use --parameter-dir to cache parameters, you must use it during all subsequent steps, since it gets the germline set from the parameter dir. In this particular case, the 1m is a caprisa germline, if that helps.

dawahs commented 7 years ago

Yes, that is helpful. Thanks.