rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

KeyError thrown in 'Merging MSA' step of trycycler msa #35

Closed AshSies closed 2 years ago

AshSies commented 2 years ago

Hi there,

I am running Trycycler 0.5.3, and near the end of the msa step in the pipeline, I am getting a KeyError thrown with certain clusters. In the environment I have set up, the msa makes use of MUSCLE v5.1. The traceback directs me to line 175 in msa.py:

_line 175, in merge_pieces aligned_seq_parts[n].append(parts[n].upper()) KeyError: 'A_contig1'

This has only been an issue with clusters including larger (> 2 Mb) contigs. Other clusters with smaller tigs, produced from the same 'trycycler cluster' step and assemblers, have worked just fine.

Any input would be appreciated - thank you for your time!

kclambi1 commented 2 years ago

I also have this same issue with the same error flag. Large bacterial contigs, about 5.4mb each. All good up until the MSA merging step.

sgblanch commented 2 years ago

@kclambi1 asked me to look at this. Root cause appears to be muscle crashing on some chunks. Is muscle running out of memory?

$ muscle -align 000000000000.fasta -output test_msa.fasta

muscle 5.1.linux64 []  8.2Gb RAM, 6 cores
Built Feb 24 2022 03:16:15
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 3 seqs, avg length 151666, max 231004

00:01 8.1Mb  CPU has 6 cores, running 6 threads
00:01 20.8Gb  100.0% Calc posteriors
Segmentation fault

$ echo $?
139
VMJarocki commented 2 years ago

I believe I am also getting this error:

File "/home/user/Data/miniconda3/envs/trycycler/bin/trycycler", line 10, in sys.exit(main()) File "/home/user/Data/miniconda3/envs/trycycler/lib/python3.9/site-packages/trycycler/main.py", line 51, in main msa(args) File "/home/user/Data/miniconda3/envs/trycycler/lib/python3.9/site-packages/trycycler/msa.py", line 36, in msa merge_pieces(temp_dir, args.cluster_dir, seqs) File "/home/user/Data/miniconda3/envs/trycycler/lib/python3.9/site-packages/trycycler/msa.py", line 175, in merge_pieces aligned_seq_parts[n].append(parts[n].upper()) KeyError: 'C_Utg936'

VMJarocki commented 2 years ago

@kclambi1 @AshSies @sgblanch were any of you able to resolve this issue?

VMJarocki commented 2 years ago

FYI Downgrading Muscle from 5.1 to 3.8 resolved this issue 👍

AshSies commented 2 years ago

Using Muscle 3.8 also resolved the issue on my end.

kclambi1 commented 2 years ago

This is strange, as version 0.5.2 added support for muscle versions 3.x and 5.x.

abomba commented 2 years ago

Muscle crash when has too long sequence to align and run out of memory. Different between version 3 and 5 is that muscle 5 create empty file with msa even when crash. Trycycler checking this step only by looking for lacks of msa files so when you use version 5 everything looks fine. It should chcek also if files is empty . If you want avoid running out of memmory try change --lookahead or --step to smaller value, this should affect to max piece size in partitioning step and therefore on memmory usage by muscle. https://github.com/rrwick/Trycycler/blob/9cc62a521e14a264ae8397277e2f8b09c2988c66/trycycler/msa.py#L121-L131

rrwick commented 2 years ago

Thanks to everyone here for the investigation!

I've just pushed (42be6d7) a small fix to the problem that @abomba pointed out, so Trycycler will now recognise empty MUSCLE files as being problematic. This should at least result in better error messages.

I've also made a note on the Software Requirements page of the wiki to say that MUSCLE v3 is preferred.

rrwick commented 2 years ago

@marade in #42 pointed out that Muscle v5 defaults to lots of threads, and Trycycler runs many instances of Muscle, so this may explain the out-of-memory issues. I've just pushed a fix (886fdd5) which limits Muscle v5 to one thread per instance to help with this, but I still recommend Muscle v3 for speed.