more variations called after two rounds of ont reads polishing

isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:

https://github.com/lbcb-sci/racon

MIT License

268 stars 48 forks source link

more variations called after two rounds of ont reads polishing #134

Open cai1991 opened 5 years ago

cai1991 commented 5 years ago

Hello,

Thank you very much for developing such a great polishing tool. I have used Racon to polish my raw contigs (assembled using raw ont reads) twice with ont reads. Then I called variations for both raw contigs and ont-polished contigs using Illumina reads. I found there are more variations in the polished contigs, although the total variation length is a little smaller. Is this normal?

As follows: ont-polised: racon two rounds with ont reads; illumina-polished: pillon three rounds with illumina reads

Thank you very much for your kind help.

Best regards, Chengcheng

rvaser commented 5 years ago

Hi Chengcheng, what is the coverage of your dataset? You used a reference for variation calling or the Illumina reads?

Best regards, Robert

cai1991 commented 5 years ago

Hi, Robert,

The coverage of my ont reads is ~66X. I mapped illumina reads to the assembled contigs to call variations with GATK. No reference for variation calling. The coverage of illumina reads is ~80X, which was also used in pilon polishing.

Best regards, Chengcheng

rvaser commented 5 years ago

Can you please check what is the average quality of the ont reads?

cai1991 commented 5 years ago

We generated these ont reads from 3 flowcells. The mean.q are 9.4, 8.9 and 8.8. I merged these reads together to use.

Best regards, Chengcheng

rvaser commented 5 years ago

There is a tiny chance that this is the issue as Racon employs a quality threshold of 10 on each windows. Try running one iteration with parameter -q 8 and try calling variants.

cai1991 commented 5 years ago

Thank you for the suggestion. I will try it.

Best regards, Chengcheng

cai1991 commented 5 years ago

Hi, Robert,

I suddenly realize that I used fasta files (both for the sequences and target sequences) for Racon polishing. And the overlaps file is in paf format and was generated by mapping ont reads (also fasta file) to my contigs with minimap2. Will this be a problem? How does Racon obtain quality information in this case?

Best regards, Chengcheng

rvaser commented 5 years ago

The fasta file will not be a problem as Racon does not use qualities in this case. I am not sure why there is a minimal difference in variations. The initial assembly was obtained with which assembler?

cai1991 commented 5 years ago

I used smartdenovo to assemble raw ont reads. It produced very continuous contigs with contig N50 of 9.2Mb and total contig size of 550Mb. I was very satisfied with these assembly statistics. The contig size was very reasonable. And also complete BUSCO of initial assembly was 86.4%. Racon improved it to 90.6% (round1) and 90.4% (round2).

Best regards, Chengcheng

rvaser commented 5 years ago

Does smartdenovo employ any accuracy boosting during assembly or is the final error equal to the error in raw reads?

cai1991 commented 5 years ago

I'm not sure about the details of how this assembler works. But from what I read on their github page, https://github.com/ruanjue/smartdenovo/blob/master/README-tools.md the final consensus sequence is more accurate than raw reads, reaching to 99.7%. But they still suggest to use other tools to improve the accuracy.

Best regards, Chengcheng

rvaser commented 5 years ago

Well then I think it is not that surprising that the accuracy changed a little. Maybe you can change different alignment parameters, like 2/-5/-2 or 3/-5/-4. You can also try Racon with Illumina.