Racon rounds shrink target contig

joanmarticarreras commented 4 years ago

Hi!

First, congrats for your work on Ra and Raven. I've been following and testing your work for nanopore reads for quite a while.

I work in viral genomics (dsDNA viruses), studying new viruses, making reference genomes, their diversity, repeat distribution, etc.

I've been testing Raven for quite a while, and, by comparison to the rest, it does a great job at assembling this type of data! However, I realized that the end-contig size is quite sensible to size pre-filtering and to racon iterations. Filtering to intermediate sizes (>5-10 kb) yields almost perfect contiguity. Accepting shorter sequences adds to much diversity (especially in the repeats) and the contiguity drops. If filter is higher, there is no sufficient data to close the genome.

Interestingly though, increasing numbers of Racon iterations, tends to shrink the target contig. The size is not known but it is though to believe to be between 132-150 kbp (experimental data from the '80). Around 110 - 120 kb should be unique and then tandem repeats of 1,5 kb x15 - x20 times.

Here some data: Raven v.1.1.10, nanopore reads filtered at >Q12 + >10kbp:

Racon Repeats	Target Contig length (bp)
0	132197
2	132379
4	131799
5	130988
10	130082
20	128028
30	125151
40	122883
50	120623
80	106626
100	96830

Here some data: Raven v.1.1.10, Nanopore reads filtered at >Q12 + >5kbp:

Racon Repeats	Target Contig length (bp)
2	127475
10	124968
20	121798
30	120227

What do you think might be the phenomenon behind it?

Joan

rvaser commented 4 years ago

Hi Joan, it is not odd that the size of the contig shrinks through iterations, but here it looks like it continues to shrink quite a bit. It might be due to the trimming heuristic at the end, which removes bases from both sides of each consensus window until the base coverage hits half the number of reads inside the window. You can turn it off with option --no-trimming in Racon, but to disable this in Raven you have to change true to false here, and then recompile.

On the other hand, I think 2 iterations are sufficient for consensus, after which you can just run Medaka to reach higher accuracy.

Best regards, Robert

joanmarticarreras commented 4 years ago

Thanks Robert for replying so fast. I will give it a try.

Thanks for the tip. Still, you can see how depending on which set of reads I start using, the length varies a lot as well. Any ideas?

Joan

rvaser commented 4 years ago

What is the contig length when you have 0 Racon iterations with >5kbp reads? This is probably due to different reads constituting the layout sequence.

joanmarticarreras commented 4 years ago

Using >Q12 >10kb reads:

Racon Repeats	Target Contig length (bp)
0	132197
2	133370
5	133169
10	133234
20	133132
30	133161

Using >Q12 >5kb reads:	Racon Repeats	Target Contig length (bp)
0	129666
2	130134
5	130343
10	130405
20	130346
30	130532

After recompiling, using --no-trimming in Racon, the target contig size is a bit more stable, especially for the > 5 kb dataset. Will this be a better estimation of the genome or would you recommend otherwise and trim the ends? I am afraid that 5 kb might be too short for a reliable estimation of the tandem repeats too. How sensible is Raven and Racon to tandem repeats?

Joan

rvaser commented 4 years ago

For this dataset, I think I would use the --no-trimming option. You can check out https://github.com/isovic/racon/issues/126, where a similar discussion took place (some had problems with telomeres, while a user was dealing with viruses). I am not sure how sensible Racon is with tandem repeats, I suppose the longer the average read length is the better.

Sorry for not replying earlier! Best regards, Robert

lbcb-sci / raven

Racon rounds shrink target contig #21