psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

Assertion error when partitioning on a single seed #245

Closed williamdlees closed 6 years ago

williamdlees commented 6 years ago

Hi Duncan, I am trying to partition clones using a single seed with the command

nohup partis partition --infname P2_atleast-2.fasta --outfname partition_BO2C11_clone_1500.csv --n-procs 15 --seed-unique-id 1GTTCAAACAAACTAACATAACT &

This leads to an assertion error implying (to me) that the seed sequence doesn't survive processing. I can, however, see it in _output/P2_atleast-2/sw-cache-819604701847066121.csv and it appears to be parsed as a functional sequence.

My Partis code is up to date.,

Thanks for your help.

Full output:

caching parameters vsearch: 302424 / 303739 v annotations (1315 failed) in 29.0 sec keeping 43 / 239 v genes smith-waterman step seqs procs ig-sw time processing time 1 303739 15 247.1 230.7 2 26734 15 22.9 warning for query 1GCCCCTACATACTTACACACTT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] warning for query 1GATTCTACATATTTACACTTCT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] 21.4 3 12552 15 12.1 warning for query 1GCCCCTACATACTTACACACTT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] warning for query 1GATTCTACATATTTACACTTCT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] 7.9 4 4408 15 5.4 warning for query 1GCCCCTACATACTTACACACTT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] warning for query 1GATTCTACATATTTACACTTCT: k_v_min + icheck - j_start = 293 + 2 - 294 = 1 [ < 0 or >= len(qrseq) = 1 or len(glseq) = 1] 2.2 info for 299634 / 303739 = 0.986 (4105 failed) kept 17662 (0.058) unproductive removed 288815 / 299634 = 0.96 sequences with cdr3 length different from seed sequence (leaving 10819) removed 1556 / 10819 = 0.14 duplicate sequences after trimming framework insertions (leaving 9263) water time: 630.1 v mutations: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 81 49 11 4 10 10 5 4 7 8 10 4 5 4 5 7 2 1 1 1 2 0 0 0 hv1-1801 106 147 22 14 10 8 6 3 10 7 6 4 8 4 2 3 2 3 4 5 2 0 2 1 hv1-202 85 46 5 13 11 20 18 12 3 6 8 7 7 4 2 1 2 3 0 0 1 2 0 0 hv1-4601 140 71 11 6 11 3 2 7 2 6 1 5 2 4 5 6 1 2 2 0 0 1 0 0 hv1-6901 89 184 21 8 2 5 1 3 5 1 1 3 0 3 1 0 4 0 0 0 0 0 0 0 hv2-502 100 32 19 4 7 8 3 8 2 4 3 3 10 5 6 8 5 8 6 9 0 2 6 0 hv3-1501 93 20 14 8 19 13 9 12 1 7 8 6 4 3 2 0 1 0 0 0 0 0 0 0 hv3-2101 129 29 9 10 13 13 10 7 2 7 4 4 3 1 1 3 4 2 2 0 0 1 5 1 hv3-2304 118 36 18 15 224 23 13 8 81 11 13 17 7 5 5 10 6 2 2 11 6 1 2 1 hv3-3301 38 26 9 7 6 6 6 6 6 7 13 4 8 3 6 9 10 7 2 0 4 4 2 7 hv3-4801 37 8 31 4 4 12 9 18 13 23 25 22 15 14 24 13 11 14 15 5 10 4 1 8 hv3-701 61 19 9 10 18 5 8 12 21 21 8 17 8 8 6 7 16 10 22 10 9 4 11 4 hv3-7401 140 110 26 16 8 5 8 7 5 1 3 13 1 9 1 5 1 3 2 2 1 2 1 1 hv4-3402 167 49 11 7 13 5 1 5 3 5 1 2 4 1 1 4 1 3 1 0 3 0 3 0 hv5-5101 39 75 5 12 4 3 10 9 3 0 0 2 4 1 0 0 0 0 0 0 1 0 0 0 hv6-101 sequence counts: excluded excluded excluded included actually

23 mutations 5p del (>N bases) 3p del (>N bases) total seqs clones used 11 0 (23) 3 (3) 309 193 231 hv1-1801 1 2 (23) 1 (6) 494 319 379 hv1-202 2 1 (23) 0 (6) 326 200 256 hv1-4601 1 1 (24) 3 (6) 329 239 288 hv1-6901 1 2 (22) 0 (6) 504 130 331 hv2-502 9 0 (23) 3 (7) 317 222 258 hv3-1501 0 2 (23) 1 (7) 272 191 220 hv3-2101 2 1 (23) 0 (8) 285 227 260 hv3-2304 17 6 (24) 0 (22) 745 600 635 hv3-3301 24 0 (23) 0 (4) 241 185 196 hv3-4801 25 2 (23) 5 (5) 529 330 340 hv3-701 21 1 (23) 4 (3) 438 304 324 hv3-7401 104 1 (23) 6 (4) 664 248 371 hv4-3402 1 0 (23) 3 (5) 352 237 290 hv5-5101 0 0 (23) 1 (3) 268 80 168 hv6-101 try 0: looking for new alleles among 15 genes (28 genes didn't have enough counts) adding new allele to glfo: template CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA hv1-202 new CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGACGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA hv1-202+T147C adding new allele to glfo: template CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA hv3-3301 new CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAGAGA hv3-3003 removing template gene hv3-3301 smith-waterman (writing parameters) step seqs procs ig-sw time processing time 1 299634 15 264.8 259.7 2 22627 15 22.4 16.4 3 8430 15 10.1 6.5 4 301 15 2.6 0.3 info for 299633 / 303739 = 0.986 (4106 failed) kept 17659 (0.058) unproductive removed 288814 / 299633 = 0.96 sequences with cdr3 length different from seed sequence (leaving 10819) removed 1556 / 10819 = 0.14 duplicate sequences after trimming framework insertions (leaving 9263) writing sw results to _output/P2_atleast-2/sw-cache-819604701847066121 writing parameters (28.8 sec) water time: 1010.7 writing hmms (29.5 sec) hmm prepare_for_hmm: (1.6 sec) running 15 procs calcd: vtb 9277 fwd 0 min-max time: 18.0 - 23.1 sec read output Traceback (most recent call last): File "/d/user5/wlees01/my_as2/partis/bin/partis", line 627, in args.func(args) File "/d/user5/wlees01/my_as2/partis/bin/partis", line 189, in run_partitiondriver parter.cache_parameters() File "/d/as2/u/wlees01/partis/python/partitiondriver.py", line 272, in cache_parameters self.run_hmm('viterbi', parameter_in_dir=self.sw_param_dir, parameter_out_dir=self.hmm_param_dir, count_parameters=True) File "/d/as2/u/wlees01/partis/python/partitiondriver.py", line 946, in run_hmm new_cpath = self.read_hmm_output(algorithm, n_procs, count_parameters, parameter_out_dir, precache_all_naive_seqs) File "/d/as2/u/wlees01/partis/python/partitiondriver.py", line 1442, in read_hmm_output self.read_annotation_output(self.hmm_outfname, count_parameters=count_parameters, parameter_out_dir=parameter_out_dir, outfname=self.args.outfname) File "/d/as2/u/wlees01/partis/python/partitiondriver.py", line 1644, in read_annotation_output assert uidstr not in padded_annotations AssertionError Traceback (most recent call last): File "/d/user5/wlees01/my_as2/partis/bin/partis", line 627, in args.func(args) File "/d/user5/wlees01/my_as2/partis/bin/partis", line 202, in run_partitiondriver check_call(newargv) File "/d/user5/wlees01/.conda/envs/partis/lib/python2.7/subprocess.py", line 186, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command '['/d/user5/wlees01/my_as2/partis/bin/partis', 'cache-parameters', '--infname', 'P2_atleast-2.fasta', '--outfname', 'partition_BO2C11_clone_1500.csv', '--n-procs', '15', '--seed-unique-id', '1GTTCAAACAAACTAACATAACT']' returned non-zero exit status 1 parameter dir '_output/P2_atleast-2' does not exist, so caching a new set of parameters before running action 'partition'

psathyrella commented 6 years ago

Hi William,

hm... sorry about that. So that assertion is while it's looping over the hmm output file, and it's checking that the same group of sequence ids wasn't already read from the file. I.e. it's failing because it read the same group of queries more than once. I've never run into it, and I can't really say why it's happening in this case (if you can pass me the input file so I can replicate that'd be great), but that said it's not that big a deal. It's most likely the seed sequence somehow got annotated twice, so I've changed it to just print a warning for the time being here. Let me know if that helps.

psathyrella commented 6 years ago

ah, nice, I replicated it, so no need for your input file. I usually run parameter caching as a separate step before partitioning, so I'd probably never run with the combination of auto-parameter caching and partitioning with a seed sequence (on multiple processes, which caused the duplication for, uh, complicated reasons).

In any case, that ^ fix should be fine for you, and I'll push a more permanent fix later today.

psathyrella commented 6 years ago

ok I pushed a better fix here. I also noticed that besides the duplicated seed seq (which shouldn't have had an effect), I was removing non-clonal sequences before doing parameter caching. Which, well, was the idea with seed seqs -- ignore everybody that's non-clonal. But the parameters will be more accurate if we cache-parameters with the entire sample (particularly the germline, and particularly if the seed family is small), so I switched to caching parameters on the entire sample, and then removing the non-clonal sequences.

williamdlees commented 6 years ago

Thanks for the quick update, Duncan. Has fixed the problem: Partis has run through to completion now.