isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
269 stars 49 forks source link

racon introduces large (50-250 base) deletions into polished canu assembly consensus #152

Open tpshea2 opened 4 years ago

tpshea2 commented 4 years ago

Is there any previously described racon polishing behavior where racon, using only ONT data, introduces deletions into the polished consensus? And if so is there a suggested racon parameter (or minimap2 alignment parameter, prior to racon) to help avoid such issues?

input PAF file was created using minimap2 -x map-ont.

we have a completed reference (same strain of Klebsiella) to compare

The original canu assembly has a ~5.3 Mb contig which aligns to reference chromosome end to end at 99.54% (showing nucmer output):

image

and then after a round of racon polishing there is an overall increase in base accuracy but also a number of deletions (deletion size = far right column)

image

Inspecting one of the regions there appeared to be no large issue in the original Canu contig so it is not clear why racon would have introduced such deletions.

We find that this occurs much more often (~15 times in the chromosome contig) when we give ~25x estimate genome coverage of ONT data. When we polishing with 50x-100x of ONT data the incidence of introduced deletions falls to around 2-5. We also see this when we polish with R10 as well as R9 data.

Any insight is much appreciated. Thank you.

rvaser commented 4 years ago

Hi Terrance, quite interesting. Can you please run again with parameter --no-trimming? You will need to use the latest Racon version from https://github.com/lbcb-sci/racon.

I would appreciate if you could share your data, so we can investigate in depth what is going on.

Best regards, Robert

tpshea2 commented 4 years ago

Thank you so much for the very quick reply.

I ran the latest version, 1.4.11, both with and without the --no-trimming option. The default (without --no-trimming) for 1.4.11 largely matched what I got with my initial, previously reported run (it was an older version, 1.3.1). But using the 1.4.11 with --no-trimming option there were no such large deletions introduced. Here is how the racon-polished consensus (with --no-trimming) option aligns to reference chromosome:

image

In all racon runs I was using 6 threads.

So it seems this --no-trimming option has solved our specific issue and we will make sure to use this version and this option. Do you still need the data to check anything on your end?

Thank you again for your help.

rvaser commented 4 years ago

I suspect that 25x data is somehow badly distributed across the genome or something else is wrong. You can still send me the data, if it is not a problem :)

rvaser commented 4 years ago

I inspected a little bit what is happening with the underlying POA graphs. The breaks usually occur on windows that (I suspect) overlap repetitive regions in which a percentage of the window has high coverage while the other part has coverage equal to the median of the chromosome. The heuristic then trims the lower coverage part out which results in some of the windows losing a lot of bases from one side (up to 200bp). The best results I got on the data you have sent me is by replacing https://github.com/lbcb-sci/racon/blob/master/src/window.cpp#L136 with the following:

} else if (consensus_.size() - (end - begin) < 10) {

This prohibits big deletions while trimming, resulting in one alignment with 0.9973 accuracy (default Racon yields 28 break points and 0.997 accuracy, while --no-trimming option gives one alignment with 0.9964 accuracy).

We will see what is the best way to replace the current heuristic for window trimming. Until then you can use either --no-trimming or modify the code as above.

Best regards, Robert

tpshea2 commented 4 years ago

Robert- Thank you for taking this very close look and helping with the specific suggestions. Your responsiveness is very much appreciated.

rvaser commented 4 years ago

The average read length of your downsampled read set is 1260bp, which is quite small for ONT data (65% of reads is shorter than 500bp, while 83% of reads is shorter than 1000bp). The problem with such short data is that the coverage of windows can be quite uneven, especially when it comes to repetitive regions. This is the reason we turned off the trimming method for Illumina data. Bellow you have the pile-o-gram created from overlaps of your read set to the Canu contig. You can see a bunch of spikes, two of which are enlarged bellow as well. The deletions you encountered are tied to those spikes.

chrom.pdf first.pdf second.pdf

Best regards, Robert