lbcb-sci / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads
MIT License
197 stars 34 forks source link

Recommended parameters for CLR vs ONT #31

Closed SHuang-Broad closed 3 years ago

SHuang-Broad commented 4 years ago

Hi,

I'm wondering if you have any recommendations on parameters for

I'm asking because I've observed great improvements from Racon on ONT drafts, but when polished with the same default parameters on CLR drafts, the results wasn't much better.

Of course it could be that the parameters used for the overlap generation weren't optimal, but I'd like to see if you've already have some recommendations.

Thanks!

Steve

rvaser commented 4 years ago

Hi Steve, sorry for asking, but CLR is an abbreviation for?

Best regards, Robert

SHuang-Broad commented 4 years ago

Sorry Robert. What I meant was the continuous long read (CLR) protocol from PacBio. Compared to CCS/HiFi from PacBio, it has a higher error rate.

rvaser commented 4 years ago

I don not think that I have a special set of parameters that should work better on CLR data. You could maybe try the old Racon parameters (5, -4, -8) or with (2, -5, -2) for match, mismatch and gap, respectively. Usually we do not change mapper parameters.

SHuang-Broad commented 4 years ago

Thanks Robert! I'll try and report back.

SHuang-Broad commented 4 years ago

Hi Robert,

while using the parameter set (5, -4, -8) with GPU enabled and splitting the work into three chunks using the python script, I received the follow error

[racon::Window::generate_consensus] warning: contig 41 might be chimeric in window 66276!
[racon::Window::generate_consensus] warning: contig 41 might be chimeric in window 66277!
[racon::Window::generate_consensus] warning: contig 41 might be chimeric in window 66278!
[racon::Window::generate_consensus] warning: contig 41 might be chimeric in window 66279!

How should I interpret these warnings, and what should I watch out for in the output?

Thanks!

rvaser commented 4 years ago

Hi Steve, this warning means that the window trimming heuristic could not trim both sides of the consensus. The method searches for the first and last consensus base which is covered with at least half reads that are inside the window. The warning arises when the coverage of the window is uneven, i.e. it could be a repetitive region coupled with a lot of short reads mapping to it or a chimeric region. Usually the warning is a false positive, so you can ignore it.

Best regards, Robert

P.S. We are testing a new heuristic which should work better for uneven coverage. Also, we are trying to reinstate overlapping windows which will make the trimming method obsolete.

SHuang-Broad commented 4 years ago

Hi Robert,

Thanks for the comment on the warnings!

I tried both "old" parameters on my draft, both only 1 round. The sad fact is—based on QUAST—is that the indel and mismatch rates deteriorate with the new and the old parameter sets, but the old default parameter set (2, -5, -2) is the one that is not as bad as the other two sets.

Not good news I know, but this is just one data point.

But, but but, one trend we do observe, is that results are usually quite good with ONT data.

Thanks for all the help!

rvaser commented 4 years ago

Meaning that the (2, -5, -2) works good on both PB and ONT data?

SHuang-Broad commented 4 years ago

Sorry, the new parameter tend to work good on ONT data.

None of the parameter presets seem to improve the PacBio data I worked on, and (2, -5, -2) seem to be the one that caused the least drop.

The metrics I used is indel and mismatch rate from QUAST.

If you are interested, I believe I can email you the QUAST tables (we don't own the data, so I cannot share the raw data).

rvaser commented 4 years ago

If it is not a problem, I would like to have a look at the QUAST results :)

SHuang-Broad commented 4 years ago

Tables sent via email.