isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
257 stars 48 forks source link

Question: use Racon to polish? #9

Open rrwick opened 7 years ago

rrwick commented 7 years ago

I was curious if you thought Racon would suitable as a long read polisher. E.g. if one had a completed assembly which is 99.9+% correct, could Racon use long read alignments to help fix up the remaining few SNPs and indels?

I'm thinking of something like GenonmicConsensus or Nanopolish, but those are both specific to the sequencing technology. Since Racon takes aligned FASTQ as input, I thought it could perhaps be a long read polisher that is sequencing platform-agnostic.

If you do think it's suitable for this sort of thing, then I'll also throw in a feature request: optionally outputting a variants file which shows the differences between the raw input and the consensus output. That might not make sense for a Miniasm input (there'd be too many), but it could be very helpful when polishing a mostly-finished genome.

rvaser commented 7 years ago

Hello Ryan,

Racon was primarily intended as a consensus tool, not a polisher. Without signal level information it would be hard to reach the accuracy of other polishers. We have plans to add some statistics on differences between the backbone and consensus sequence. Maybe you can try to polish your data when the feature is implemented. I'll mark this issue as an enhancement.

Thanks for considering Racon and sorry for the delayed answer.

Best regards, Robert

rrwick commented 7 years ago

Sounds good - I'll give it a try when it's ready. Thanks!

xthua commented 7 years ago

hi, I try to use racon to polish genome sequence from pacbio sequel. However, pacbio sequel provide bam format file. when i used "samtool bam2fq" to convert bam to fastq, the sequence was well, and the quality values were all "!". Is it possible to use pacbio sequel data polish genome sequence?

Best wishes,

Xiaoting

rvaser commented 7 years ago

Hello Xiaoting,

as stated above, Racon was primarily intended as a consensus tool, not a polisher. Nevertheless, you can try and use your PacBio sequel data to polish your assembly, but be sure to disable quality filtering with parameter '--bq -1' as all of your qualities have the same value.

Best regards, Robert

rrwick commented 7 years ago

FYI, I've found it useful to use MUMmer to extract the specific changes that Racon makes, so I can evaluate them individually:

minimap -t 24 assembly.fasta long_reads.fastq.gz | racon -t 24 long_reads.fastq.gz - assembly.fasta racon_assembly.fasta
nucmer -p nucmer assembly.fasta racon_assembly.fasta
show-snps -C -T -r nucmer.delta

This reports Racon's changes in a table. You can exclude indels with the -I option in show-snps. I'm doing this in unicycler_polish.py (still a work in progress).

This process (Racon -> MUMmer -> SNP table) solves the problem I originally raised in this issue. So as far as I'm concerned, you can close this issue (or keep it open if you still want to implement some kind of variant table).

rvaser commented 7 years ago

Hello Ryan, thank you four your update. It is quite a handy approach. I will leave the enhancement open until we decide whether to implement it or not.

Best regards, Robert

mictadlo commented 7 years ago

Hi, I ran racon twice after miniasm and then I ran Pbjelly, a scaffolder using PacBio reads. Does it make sense to run Racon twice again on the scaffolder output?

Thank you in advance.

Michal

rvaser commented 7 years ago

Hello Michal, you can run Racon again to increase the accuracy in bridging areas between contigs. It would be best to run Racon with the subset of reads which map only to those new areas in order to decrease the running time, although that is not necessary.

Best regards, Robert

mictadlo commented 7 years ago

Hi Robert, How do I create a subset of reads which map only to bridging areas between contigs in order to decrease the running time?

Thank you in advance.

Michal

rvaser commented 7 years ago

You can find them by checking the *.paf file of the second Racon iteration and look for reads that did not map to any of the contigs and those that are not contained in any of them (i.e. they have a prefix sufix overlap with one of the contigs). You would have to write a script for that which might take some time so I would suggest to run 2 Racon iterations with the whole read set and the ouput of Pbjelly.

Best regards, Robert

nottwy commented 7 years ago

What's the difference between consensus and polish you mean here ? In the description of racon, you said "The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods." In my view, the consensus employed by other assembly methods means polish. Is there a clear line between consensus and polish? Wait for your sharing of your consideration about this problem.

rvaser commented 7 years ago

As I see it, consensus is part of the OLC paradigm in which you obtain an initial assembly while polishing is using a different sequencing technology (like illumina) or sequencer specific information (like signal level of ONT) to further increase the accuracy of the assembly. You might use Racon as a polisher, but it was primarily developed as a consensus module.

Best regards, Robert