rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Tell users about config parameters available to bipass failed end repair. #80

Closed LeeBergstrand closed 4 months ago

LeeBergstrand commented 10 months ago

Problem

So, my current genome fails end repair due to the contigs not being able to be circularized.

“ERROR: run_end_repair: 1 contigs could not be circularized. A partial output file including successfully circularized contigs (and no linear contigs) is available at cram/assembly/end_repair/repaired.fasta for debugging. Exiting with error status. See temporary files and verbose logs for more details.”

In the config, there is a keep_unrepaired_contigs parameter which allows the pipeline to proceed:

# By default, end repair fails if a circular contig can't be end-repaired (i.e., if a contig can't be built that spans
#   the ends of the contig). Failing to build the end-spanning contig can happen for several reasons (e.g., if that
#   region is a repeat-rich region), even if the contig is really circular. Set keep_unrepaired_contigs to 'True' to keep
#   the pipeline going even if end repair fails on a contig. If set to 'True', then the original assembled versions of
#   any contigs that fail end repair will be used for downstream steps and will still be treated as circular.
keep_unrepaired_contigs: 'False'

However, this parameter is not mentioned in the error message, which might make the user think the pipeline fails permanently.

Proposed Solution

Mention the flag in the error message.

LeeBergstrand commented 10 months ago

@jmtsuji In what situation would end repair fail? Would it fail if it's linear? Can a contig be circularized without end repair?

jmtsuji commented 9 months ago

Good point -- the user should be notified of the option to change the flag. Will work on this when I get the chance.

In what situation would end repair fail? Would it fail if it's linear? Can a contig be circularized without end repair?

Flye already reports whether its assembled contigs are circular or linear. However, Flye can make indel errors around the ends of circular contigs (e.g., the two ends might have a gap of 50 bp or something between them). To try to fix those errors, the end repair script tries to assemble a short "stitch contig" that spans the two ends of each supposedly circular input contig. It then uses a module of circlator to try to match the "stitch contig" onto the two ends of the input contig, and if a match is found, the circlator module replaces ends of the input contig with the "stitch contig".

If Flye says the contig is circular but end repair cannot build a stitch contig that spans the two ends of the contig, then the end repair script fails by default. (If Flye says the contig is linear, then the end repair script skips that contig, i.e., the contig just passes through without any error message.) Setting keep_unrepaired_contigs to True means that a contig Flye says is circular will be passed through the end repair script as-is (rather than the script failing) even if end repair can't manage to stitch the contig ends. That contig might be in good enough shape to get circularized properly during polishing downstream, but if it has a large indel, in my experience this cannot get fixed by polishing.

I think the current version of the end repair script is too picky right now with matching up the contig ends with the stitch contig, and I want to address this in the custom code I write (stitch.py) that will eventually replace the circlator code as per rotary-genomics/rotary-utils#8 and rotary-genomics/rotary-utils#10 . For now, tuning the end repair params in the config might also help.

LeeBergstrand commented 4 months ago

keep_unrepaired_contigs: 'False' is now set to True by default.