rotary-genomics / spokewrench

Toolkit for manipulating circular DNA sequence elements
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Ensure proper polishing of end-repaired circular contigs #9

Open jmtsuji opened 1 year ago

jmtsuji commented 1 year ago

Problem description

Currently (version 0.2.0-beta2), flye_end_repair.sh is run after the assembly step to fix any potential issues with the region around the ends of circular contigs. At the end of that script, the contig is rotated to the opposite end of the post-stitch end to try to ensure good quality read mapping over the stitched region (especially the ends of the 50-100 kb region that were attached) during subsequent long read polishing. However, a possible use case exists, for short circular contigs (e.g., 150 kb), where the rotation at the end of flye_end_repair.sh will still keep the 2 stitched regions near the contig ends in the FastA file. In this case, they still won't be polished properly during long read polishing. They will receive at least one additional round of long or short read polishing after further rotation in the downstream steps of the pipeline, but if only short read polishing is done, I'm not confident it will be able to fix any potential insertion/deletion errors that could be introduced during the merge.

Possible solutions

A variety of possible methods could be used to ensure correct polishing of the end-repaired contigs:

Precise method: develop a method to exactly track where the stitched region is within the repaired contig (it is not reported by the circlator merge module). For example, the ends of the 50-100 kb contig made by flye re-assembly could be saved and used via BLAST (or something) to identify the region. Then, it could be ensured that those end regions are in a good place for read mapping via search after the end of end repair. It might also be possible to compare the pre and post repair regions to see what actually changed (this could also be helpful), like by variant calling.

Brute force method: just rotate and polish the end repaired contig more than once (e.g., rotating 1/3 at a time, polishing a total of 3 times) to make sure you polish over the regions. But I wonder if over-polishing could introduce errors in the genome.

Warn method: Because the cases where these errors could occur are rare, the end repair module could just throw a warning when the region assembled by Flye is over half the length of the contig. It could then save a copy of the region assembled by Flye and the original contig ends for the user to manually inspect.

Potential complications

It's not clear to me whether the circlator merge module always stitches in the entire provided contig (re-assembled by Flye) or if it might just add in a subset in some cases. If it might add a sub-region, this will complicate the "precise method" above.

Also, I am not sure what kind of internal checks are done by the circlator merge module to ensure the merge worked. If it carefully checks the way that the stitch was performed (to avoid indel errors at the insertion site), then all this concern about proper polishing might not be needed.

jmtsuji commented 1 year ago

Some personal notes:

Based on looking a bit at the circulator merge module's source code and output logging files, I suspect that it's pretty careful in the final stitch.

The key function looks to be make_circularised_contig. I assume (but haven't yet been able to confirm) that this function takes the start coords of the nucmer alignment to the original contig and then puts the start coords of the nucmer alignment of the re-assembled contig there. Same for the ends. This would mean that there might be max 1-2 indels that could be inserted due to an odd alignment, but I wouldn't normally expect much more. I don't think the module would just stitch the reassembled sequence in as-is without adjusting to the start/end points of the nucmer alignments (especially because circlator gives by default a 1000 bp tolerance for the nucmer alignment start/end relative to the reassembled contig).

Based on looking at that make_circularised_contig function, it also appears as though the finished contig (post repair) might just be the original contig first (in the FastA sequence) followed by the stitch region, based on the bases section of the function. I should check if this is really the order of the output, using real data, at some point.

So in other words, circulator merge might be careful enough that I don't really need to worry about rotating things just right to ensure proper polishing. Given that two rounds of polishing are done, both stitch regions should get polished at least once, even in a "worst case scenario" where they are on exact opposite ends of a contig.

BUT I still need to more carefully check my above assumptions about the polish code (e.g., using a real example) to verify.

LeeBergstrand commented 5 months ago

@jmtsuji Feel free to close if most of this issue has been addressed by your new code.

jmtsuji commented 5 months ago

Closely related to #10 . It should be possible to close this I finish development of a module analogous to circulator merge. In the module I am working on, I hope to track the stitch region carefully and so make sure that it ends up in a region where it can be polished well downstream.