lbcb-sci / raven

De novo genome assembler for long uncorrected reads
MIT License
207 stars 21 forks source link

Racon rounds shrink target contig #21

Closed joanmarticarreras closed 4 years ago

joanmarticarreras commented 4 years ago

Hi!

First, congrats for your work on Ra and Raven. I've been following and testing your work for nanopore reads for quite a while.

I work in viral genomics (dsDNA viruses), studying new viruses, making reference genomes, their diversity, repeat distribution, etc.

I've been testing Raven for quite a while, and, by comparison to the rest, it does a great job at assembling this type of data! However, I realized that the end-contig size is quite sensible to size pre-filtering and to racon iterations. Filtering to intermediate sizes (>5-10 kb) yields almost perfect contiguity. Accepting shorter sequences adds to much diversity (especially in the repeats) and the contiguity drops. If filter is higher, there is no sufficient data to close the genome.

Interestingly though, increasing numbers of Racon iterations, tends to shrink the target contig. The size is not known but it is though to believe to be between 132-150 kbp (experimental data from the '80). Around 110 - 120 kb should be unique and then tandem repeats of 1,5 kb x15 - x20 times.

Here some data: Raven v.1.1.10, nanopore reads filtered at >Q12 + >10kbp:

Racon Repeats Target Contig length (bp)
0 132197
2 132379
4 131799
5 130988
10 130082
20 128028
30 125151
40 122883
50 120623
80 106626
100 96830

Here some data: Raven v.1.1.10, Nanopore reads filtered at >Q12 + >5kbp:

Racon Repeats Target Contig length (bp)
2 127475
10 124968
20 121798
30 120227

What do you think might be the phenomenon behind it?

Joan

rvaser commented 4 years ago

Hi Joan, it is not odd that the size of the contig shrinks through iterations, but here it looks like it continues to shrink quite a bit. It might be due to the trimming heuristic at the end, which removes bases from both sides of each consensus window until the base coverage hits half the number of reads inside the window. You can turn it off with option --no-trimming in Racon, but to disable this in Raven you have to change true to false here, and then recompile.

On the other hand, I think 2 iterations are sufficient for consensus, after which you can just run Medaka to reach higher accuracy.

Best regards, Robert

joanmarticarreras commented 4 years ago

Thanks Robert for replying so fast. I will give it a try.

Thanks for the tip. Still, you can see how depending on which set of reads I start using, the length varies a lot as well. Any ideas?

Joan

rvaser commented 4 years ago

What is the contig length when you have 0 Racon iterations with >5kbp reads? This is probably due to different reads constituting the layout sequence.

joanmarticarreras commented 4 years ago

Using >Q12 >10kb reads:

Racon Repeats Target Contig length (bp)
0 132197
2 133370
5 133169
10 133234
20 133132
30 133161
Using >Q12 >5kb reads: Racon Repeats Target Contig length (bp)
0 129666
2 130134
5 130343
10 130405
20 130346
30 130532

After recompiling, using --no-trimming in Racon, the target contig size is a bit more stable, especially for the > 5 kb dataset. Will this be a better estimation of the genome or would you recommend otherwise and trim the ends? I am afraid that 5 kb might be too short for a reliable estimation of the tandem repeats too. How sensible is Raven and Racon to tandem repeats?

Joan

rvaser commented 4 years ago

For this dataset, I think I would use the --no-trimming option. You can check out https://github.com/isovic/racon/issues/126, where a similar discussion took place (some had problems with telomeres, while a user was dealing with viruses). I am not sure how sensible Racon is with tandem repeats, I suppose the longer the average read length is the better.

Sorry for not replying earlier! Best regards, Robert