psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Unequal sequence length exception #212

Closed Irrationone closed 7 years ago

Irrationone commented 8 years ago

Hello,

I'm getting an unequal sequence exception thrown during the hmm part of cache_parameters that supposedly due to naive_sequence length =/= mature seq length. (I pulled partis on Aug 24). I'm working with very large sequence files so it might take awhile to identify the sequence that causes this. Is there a workaround/fix for this? Traceback below.

Thanks!

Traceback (most recent call last):
  File "./bin/partis.py", line 398, in <module>
    args.func(args)
  File "./bin/partis.py", line 146, in run_partitiondriver
    auto_ran_cache_parameters = check_maybe_auto_cache_parameters(args, gldir)
  File "./bin/partis.py", line 35, in check_maybe_auto_cache_parameters
    parter.cache_parameters()
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 237, in cache_parameters
    self.run_hmm('viterbi', parameter_in_dir=self.sw_param_dir, parameter_out_dir=self.hmm_param_dir, count_parameters=True)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 672, in run_hmm
    new_cpath = self.read_hmm_output(algorithm, n_procs, count_parameters, parameter_out_dir, precache_all_naive_seqs)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 1129, in read_hmm_output
    self.read_annotation_output(self.hmm_outfname, count_parameters=count_parameters, parameter_out_dir=parameter_out_dir, outfname=self.args.outfname)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 1280, in read_annotation_output
    eroded_line = utils.reset_effective_erosions_and_effective_insertions(self.glfo, padded_line, aligned_gl_seqs=self.aligned_gl_seqs)  #, padfo=self.sw_info)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 552, in reset_effective_erosions_and_effective_insertions
    add_implicit_info(glfo, line, aligned_gl_seqs=aligned_gl_seqs)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 665, in add_implicit_info
    line['mut_freqs'] = [hamming_fraction(line['naive_seq'], mature_seq) for mature_seq in line['seqs']]
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 1212, in hamming_fraction
    distance, len_excluding_ambig = hamming_distance(seq1, seq2, extra_bases=extra_bases, return_len_excluding_ambig=True)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 1180, in hamming_distance
    raise Exception('unequal length sequences %d %d' % (len(seq1), len(seq2)))
Exception: unequal length sequences 275 272
psathyrella commented 8 years ago

Sorry, it's really hard to say just from the trace. I definitely in general try to not throw an exception unless something has happened which really absolutely shouldn't be allowed to happen (i.e. not just because a sequence has a weird annotation). And it looks like that's the case. Any chance you can pass me the data and the command line you ran? I'd be happy to run on it.

Irrationone commented 8 years ago

That would be really helpful, thanks! I'll ask for permission to share the sequences, and if I can get it, message you with the info.

psathyrella commented 8 years ago

oh, wait, no, this just happened to me. Looks like I screwed up something in the interaction between the new smith-waterman cache file reading and N-padding. I'll push a fix.

psathyrella commented 8 years ago

ok, should be fixed. Please reopen if not, though. Thanks for noticing it.

Irrationone commented 8 years ago

This is still happening -- though it's no longer an issue with small input files. I ran a test example of 10k sequences with no errors, but the ~30-40 cases with 300k+ sequences that I've run have all failed with the unequal sequence exception.

Unfortunately, I still haven't been able to get permission to share the files. However, I noticed that you updated the exception code to print the naive and mature sequences. I'm not sure how to map the mature sequences back to the input file to find problematic sequences (I couldn't find exact matches), and I'm not even sure if that would help if the hmm parameters depend on the entire input file. To test this, I mapped a mature_seq reported by an exception back to an input sequence w/ high similarity, and ran the input sequence again through partis, this time with 250 accompanying sequences. That run did not fail.

Here are some examples of unequal seq length exceptions, if they help:

Exception: unequal length sequences 392 390:
  GTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGACAGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCGGGGAAGGCCCTGGAGTGGCTTGCACTCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACGGCCGTGTATTACTGTAGTGTCTA
  GTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTCACAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTGGAGTGGCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTCTTCAAGATTGCTCGAGGCCTCCAAAATCCCGTAGACCCTTTTCTATTCAGCCTTTGATCAATGCAATGCGACAGCCTCATGCTGATGGGTC

Exception: unequal length sequences 293 292:
  GCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCATTCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGATGTGTCC
  GCGTTGAGGCTTGCGTTTATGGTACGCTGGACTGTTGAAGCCTTGTAGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACCATCATGGAAGGCGCTGAAGTGGTTTACGGAAAACATTATTAATGGCGTCACCAACTACAGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACACGCGCAGGAAACAGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCAGAAGGAGTGATGTGTC

Exception: unequal length sequences 402 352:
  CTGGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGGTTCACCTTCAGTGGCTCTGCTATGCACTGGGTCCGCCAGGCTTCCGGGAAAGGGCTGGAGTGGGTTGGCCGTATTAGAAGCAAAGCTAACAGTTACGCGACAGCATATGCTGCGTCGGTGAAAGGCAGGTTCACCATCTCCAGAGATGATTCAAAGAACACGGCGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACGGCCGTGTATTACTGTACTAGACAGTGGTGTATGCTATACTATCTGGGGCCAAGGGACAATGGTCACCGTCTC
  CTGGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTATCTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGAGNC

Exception: unequal length sequences 278 277:
  GCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAATCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTACCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGATGTATTACG
  GTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGACTGGAGTGAATTGGCACAATGCTACAATGTGCTCCCCCAAACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTCGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGTGACCGCCGCGGAAACGCCCTCTTAAGGATATTCGCGATGAGTGNC

(PS I can't reopen this issue since I'm not a collaborator)

psathyrella commented 8 years ago

arg, sorry. This really, really shouldn't be happening. I will track it down right away.

psathyrella commented 8 years ago

could you give me the stack trace? It'd be nice, but I can't really just handle exceptions everywhere the hamming fraction function gets called, so I need to handle this a couple levels above that.

thanks

Irrationone commented 8 years ago

Here's the traceback:

Traceback (most recent call last):
  File "./bin/partis.py", line 504, in <module>
    args.func(args)
  File "./bin/partis.py", line 224, in run_partitiondriver
    _ = check_maybe_auto_cache_parameters(args, gldir)
  File "./bin/partis.py", line 53, in check_maybe_auto_cache_parameters
    parter.cache_parameters()
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 242, in cache_parameters
    self.run_hmm('viterbi', parameter_in_dir=self.sw_param_dir, parameter_out_dir=self.hmm_param_dir, count_parameters=True)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 667, in run_hmm
    new_cpath = self.read_hmm_output(algorithm, n_procs, count_parameters, parameter_out_dir, precache_all_naive_seqs)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 1119, in read_hmm_output
    self.read_annotation_output(self.hmm_outfname, count_parameters=count_parameters, parameter_out_dir=parameter_out_dir, outfname=self.args.outfname)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 1270, in read_annotation_output
    eroded_line = utils.reset_effective_erosions_and_effective_insertions(self.glfo, padded_line, aligned_gl_seqs=self.aligned_gl_seqs)  #, padfo=self.sw_info)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 674, in reset_effective_erosions_and_effective_insertions
    add_implicit_info(glfo, line, aligned_gl_seqs=aligned_gl_seqs)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 787, in add_implicit_info
    line['mut_freqs'] = [hamming_fraction(line['naive_seq'], mature_seq) for mature_seq in line['seqs']]
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 1342, in hamming_fraction
    distance, len_excluding_ambig = hamming_distance(seq1, seq2, extra_bases=extra_bases, return_len_excluding_ambig=True)
  File "/shahlab/azhang/projects/ITH_Immune/software/partis/python/utils.py", line 1310, in hamming_distance
    raise Exception('unequal length sequences %d %d:\n  %s\n  %s' % (len(seq1), len(seq2), seq1, seq2))
Exception: unequal length sequences 369 368:
  CTGGCTTTCAAAACACCGCGCCTTACGGTTACAGTATGCCCATCACAGATCGCAACACCCAGGACGCAGGTCACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACAGAGACCCTCACGCTGACCTGCACTCTCTCTGGGTTCTCACTCAGCACTTCTGGAATGGGTATGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCTCACATTTTTTTGAATGACAAAAAATCCTACAGCACGTCTCTGAAGAACAGGCTCATCATCTCCAAGGACACCTCCAAAAGCCAGGTGGTCCTTACCATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCATGCTCGTCC
  CTGGCTTTCAAAACACCGCGCCTTACGGTTACAGTATGCCCATCACAGATCGCAACACCCAGGACGCTTTTTCCCGCTCTTGGTGGCTGCTGTTGATTCTAAAAGCTACCCCCACTCCCCCCTATATGGCTTTTGGTTTCTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTGGAGTGGCTTGCAAAGAATGGAACAACTCACTAAAAACCAAGCTGGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCACAACTTACCAAGCTGGGTTACGACGCGTC
psathyrella commented 8 years ago

ok, thanks. That's actually good - it's crashing somewhere different to where it was before. Still bad, but at least it's a new problem, although it still shouldn't be able to happen.

As philosophical background: in general when something goes wrong, this is because we're processing a weird/messed up sequence, and the appropriate response is to mark the sequence as invalid, and keep on going. There's tons of places where this is how we handle things.

But, there's a smaller class of events that just, like, totally shouldn't be able to happen, and indicate something's wrong with the code that needs to be fixed. We're in this situation, which is why I can't just tell it to handle the exception by default, because the exception shouldn't be able to happen. But, I can't actually replicate it -- I tried with all the combinations of mature sequences in the error above, but it doesn't crash -- which means the problem likely stems from faulty SHM indel handling, and the mature sequences printed above would be with the indels "reversed", which would be why they don't match anything in your input file.

So, as such things go, we compromise. I'm just merging in a change that will print a ton of information about the sequence, mark it as invalid, and keep going. Hopefully I'll be able to replicate it once we get that information, but it also will at that point not crash on you, so you can get your results and not be waiting any longer for me. It should print something like this:

p

psathyrella commented 8 years ago

docker hub and debian.org seem to not be playing nicely, so if you're running of docker images you won't be able to pull yet

Irrationone commented 8 years ago

Thanks for implementing the exception handling -- I'll let you know how it goes. Just started the new jobs so ETA to error messages is likely 2-3 days from now.

By the way, why are the indels reversed? When I was digging around the code and trying to map mature seqs to similar seqs in the input I figured this might be the case, but didn't look into it further as it would have been tedious to annotate indel boundaries.

psathyrella commented 8 years ago

There's more details in the annotation paper, but the way we handle SHM indels is just within the initial smith-waterman step, which reverses them and passes the result to the hmm. It would not be difficult, in principle, to implement shm indels in the hmm, but it would be computationally infeasible, i.e. orders of magnitude slower.

yeah, definitely don't spend too much time on it, it so far looks like just a bug, I screwed up.

On Thu, Sep 15, 2016 at 10:22 PM, Irrationone notifications@github.com wrote:

Thanks for implementing the exception handling -- I'll let you know how it goes. Just started the new jobs so ETA to error messages is likely 2-3 days from now.

By the way, why are the indels reversed? When I was digging around the code and trying to map mature seqs to similar seqs in the input I figured this might be the case, but didn't look into it further as it would have been tedious to annotate indel boundaries.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/212#issuecomment-247520614, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0rc6avuLfyqczZSsbCg691yc7HJYlAks5qqieAgaJpZM4J0psS .

psathyrella commented 7 years ago

oh, rad, one of my jobs from a really big sample managed to scrounge up three sequences that throw this exception. That should be enough for me to track it down. All three have multiple shm indels, which is both really hard to get right, and hasn't been cleaned up in a while.

Irrationone commented 7 years ago

Two of my jobs have reached the stage where they start to print out these errors -- it's quite an extensive list so it's a little cumbersome to post here. If they'd be helpful, how do you want me to send the error messages to you?

psathyrella commented 7 years ago

yeah, definitely. maybe dropbox public? i think I have some idea why it happens, but I just spent a whole day *#(%ing trying to get the three sequences I mentioned above to actually crash again, so it'd be really nice if I had some that actually crashed, rather than just being weird in intriguing ways.

Irrationone commented 7 years ago

Put them on google drive; ran out of space on my dropbox:

https://drive.google.com/open?id=0B5MDPqi4noNfNzhCaWNQcVVRN0k https://drive.google.com/open?id=0B5MDPqi4noNfWW1wVm4xQmRUSlU

psathyrella commented 7 years ago

great, thanks.

Is there a reason you're running with only one process? Annotation would go vastly faster if you used more than one.

psathyrella commented 7 years ago

ah, perfect. One of those sequences actually causes the crash. Easy to fix now...

Irrationone commented 7 years ago

Only because I don't want to overload the compute cluster I'm using -- I'm running ~90 samples concurrently on compute nodes. I made a fork that worked with SGE rather than slurm, but the last time I ran it some people got upset, haha.

psathyrella commented 7 years ago

ha, that makes sense.

ooh, you did? nice! I was getting close to actually implementing that, now I don't have to. I hadn't noticed the fork before...

Irrationone commented 7 years ago

It was more of a quick patch -- I'll get around soon to fixing it up so it properly accepts user-supplied qsub arguments.