Closed SHuang-Broad closed 4 years ago
Hello Steve, this is the default behaviour, if any contig was not polished it will not be printed. You can disable this and get all input contigs as output with option:
-u, --include-unpolished
output unpolished target sequences
Best regards, Robert
Ah, that's what the -u
is for. I thought it was to keep the input fasta untouched and honestly was a little confused.
Thanks Robert!
@rvaser This has just confused me too. Rather than open a new issue, I thought it would be better to ask for clarification here. Does this mean that the default behaviour to "drop" any contigs that have no need for error-correction? Or is there something "wrong" with the "unpolished target sequences" such that Racon decides they are of low quality, i.e. they have poor read coverage.
If it's the former, you really need to state this boldly somewhere - I'd actually strongly recommend changing the default behaviour. Everyone I know (including myself until this moment) assumed that the output generated by Racon was the full error-corrected assembly!
@rvaser I just found this as well, and would like some clarification on how RaCon decides what to polish. In my case, I have long reads which cover a region which is about 10% of the reference genome at high coverage. I am assuming that if the reference genome file is formatted as a single contig, then the whole contig will be output, but only 10% will have been polished. Is that correct? Is there any way to determine which regions were polished and which weren't (i.e. which bases are a result of RaCon consensus and which were taken directly from reference)?
With another experiment (this one had low coverage), RaCon output an assembly which was perfectly accurate when compared to the reference. Incidentally, the reference was used as the <target sequences>
parameter, so I suspect it either wasn't polished or was only polished minimally. Where can we find a list of heuristics which are being used when polishing? Is there a coverage cut-off below which RaCon doesn't perform polishing? I didn't find anything in the paper; it would be great if this information could be added to the README or a Wiki. Thanks for the help!
@cabbagesofdoom, by default Racon does not output contigs which were not polished, meaning not a single window of 500bp had two reads mapped to it. Hence contigs which either of low quality, or contained in other contigs, or there are no reads supporting them, are dropped. You can get them with option -u
.
@TimD1 yes, the 10% of the reference will be polished and everything else will be copied. Unfortunately, you can't get the information which windows were polished from the executable as is. Each sequence in the resulting file has tag XC:f:<percentage>
which denotes the percentage of windows that were polished. For the coverage cut-off, there need to be at least 2 reads covering a window for the consensus to be called. Bellow I have provided you with modifications to the source code so you can get the information about polished regions.
Replace https://github.com/lbcb-sci/racon/blob/master/src/polisher.cpp#L513 with:
bool is_polished = thread_futures[i].get();
if (is_polished) {
++num_polished_windows;
std::cerr << "Sequence " << windows_[i]->id() << " is polished between " << window_length_ * windows_[i]->rank() << " - " << window_length_ * (windows_[i]->rank() + 1) << std::endl;
}
Do not forget to recompile with make
.
Great, thanks for the quick response!
Thanks, @rvaser. I think part of the problem may be the meaning of "polished". I interpreted this to mean "corrected" (i.e. altered), rather than "considered for correction" (i.e. supported by sufficient reads). I was worried that unpolished included both those too poor to be polished and those with no bases corrected by polishing. Your reply indicates that this is not the case, in which case all the affected contigs would have got dropped later anyway. Nevertheless, I would argue this is assembly "filtering" not assembly "polishing" and the behaviour should be made more explicit in the introduction and basic use examples of the tool. Cheers.
For my test case, I am seeing one contig missing in the polished assembly.
Is this expected?