rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

trycycler reconsile could continue evaluating contigs after one fails #47

Open d4straub opened 1 year ago

d4straub commented 1 year ago

Hi there,

first of all, trycycler seems a great tool, thanks for this! Disclaimer: I am using it the first time.

What I really found extremely annoying is that trycycler reconsile stops after the first contig doesnt meet requirements.

For example, I use trycycler reconsile and it stops complaining

#Error: failed to circularise sequence A_tig00000003 for multiple reasons. You
#must either repair this sequence or exclude it and then try running trycycler
#reconcile again.

That is alright, I remove it and try again, however, same problem with the second contig. I get skeptic, but try a third round after removing the second contig. Third contig also cannot be circularized, again an error. I realize that this genome part might be linear (a qucik literature confirms that this might be true), re-add all contigs, add the appropriate command (--linear), run it a forth time and it passes. However:

#Error: some pairwise identities are below the minimum allowed value of 98.0%.
#Please remove offending sequences or lower the --min_identity threshold and try
#again.

Alright, so I remove those bad contigs, and restart it 5th time, but

#Error: some pairwise indels are greater than the maximum allowed value of 1000.
#Please remove offending sequences or raise the --max_indel_size threshold and
#try again.

Again, I remove those contigs and restart 6th time.

Essentially, I am just wondering whether it wouldn't be more effective to have trycycler reconsile continue after it encounters the first "error" but stops with all those error reports in one run. I could have seen immediately that none of the contigs are circular and used --linear instead of running it 3 times. I could have immediately removed contigs with bad pairwise identities and pairwise indels.

Maybe circularisation is required to calculate indentities & indels and it has to stop when circularisation is failing, but at least it could report all contigs that fail to circularize? And maybe pairwise identities and pairwise indels could be another block that fails in one go? That would have left me with 3 instead of 6 runs, much better imho.

Best, Daniel

rrwick commented 1 year ago

I agree! Re-running Trycycler reconcile over and over can be time consuming, especially for the cluster of chromosomal contigs (larger sequences take longer to run).

I hope to re-engineer Trycycler's high-friction parts (like this) at some point. So I'll keep this issue open as an enhancement request until then.

Ryan

rrwick commented 1 year ago

I've just pushed (52b8c1f) a small improvement to this issue: trycycler reconcile now attempts to circularise all the contigs before it quits with an error message. So if 3 contigs can't be circularised, it will tell you about all three at once.

I'll leave this issue open because trycycler reconcile is still not as efficient as it could be.

Ryan