rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

python error during merging MSA #4

Closed Koen-vdl closed 4 years ago

Koen-vdl commented 4 years ago

Hi Ryan,

First of all, kudos for developing Trycycler. I love how it allows me to assemble genomes without making arbitrary calls when comparing the output of different long read assembly pipelines.

I wanted to report a python error I encountered while running trycycler msa. The below error gets printed to screen and no 3_msa.fasta is written.

Merging MSA (2020-09-28 17:51:51)
    Each of the MSA pieces are now merged together and saved to file.

MSA length: 4,751,608 bp
Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/bin/trycycler", line 33, in <module>
    sys.exit(load_entry_point('Trycycler==0.3.0', 'console_scripts', 'trycycler')())
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/trycycler/__main__.py", line 40, in main
    msa(args)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/trycycler/msa.py", line 35, in msa
    merge_pieces(temp_dir, args.cluster_dir, seqs)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/trycycler/msa.py", line 168, in merge_pieces
    assert seqs[n] == msa_minus_dashes
AssertionError

I have previously used trycycler to succesfully assemble two other isolates on this system. The error only occurs with this specific isolate.

I uploaded the trycycler reconciled contigs here in case you would like to have a look: https://koenvdl.stackstorage.com/s/OUkaDUwU2xb7M7Q4

Cheers

rrwick commented 4 years ago

Thank you for the bug report! And thanks for providing the contig file - that made it a lot easier to debug.

The bug

The problem turned out to be that MUSCLE crashes on two of the sequence pieces, giving an error like this:

[1]    85970 bus error  muscle -diags -in 000000000358.fasta -out 000000000358_msa.fasta

Trycycler failed to actually verify that all pieces ran to completion, and then when it stitched everything together it fell over because the post-aligned sequence didn't match the pre-aligned sequence.

The fix

Since the crash happens in MUSCLE, I can't really fix the true source of the problem. Googling 'MUSCLE' and 'bus error' revealed that I'm not the only one to see this, but no solution came to light. Even though I don't really get why it crashes, I can see that the sequences for this piece are very different lengths. Most are 30 kbp but one is 9 kbp and one is 3 kbp. So that's almost certainly part of the problem.

So for your specific genome, I think the solution is to remove the divergent contigs from this cluster. Specifically, that's I_Utg2200, J_Utg2318 and K_Utg2286. Without them, the MSA will be easier and should run to completion. In fact, I'm surprised that these contigs made it through the final check at the end of trycycler reconcile, considering how big of an indel they have compared to the other contigs.

I also made a little change to Trycycler. Instead of crashing with an assertion error, it now displays this message:

Error: MUSCLE failed to complete on 2 of the 4683 pieces. Please remove the most divergent sequences from this cluster and then try again.

Not a perfect solution, but it's the best I've got!

Thanks again, and I'm going to close this issue now. But feel free to reopen it if you continue to struggle with this issue.