isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

The same reads were not polished #107

Open cym0304 opened 5 years ago

cym0304 commented 5 years ago

Hi, I created a test file like this:

@test1
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The length of this read is 100 and I copied it 5 times and changed the IDs. Finally, the test file has 5 reads which has the same sequence and different name. Than I test the racon with the command like this:

minimap2 ./test2.fq ./test2.fq -a -t 4 > test2.align.sam racon ./test2.fq ./test2.align.sam ./test2.fq > test2_consensus.fa

But the consensus reads like this:

>test2 LN:i:100 RC:i:2 XC:f:1.000000
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
>test4 LN:i:100 RC:i:2 XC:f:1.000000
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

In my imagination, there will be only one sequences in the result file. But I got two same consensus reads and the RC value shows that only 4 raw reads were polished. Then I set the -w as 10 or 50, it did not work.

What the reason of this situation? And could you tell me what should I do to get the correct result? Thank you.

rvaser commented 5 years ago

Hello, what is the purpose of this test you conducted? Do you want to test read error correction or contig polishing?

Best regards, Robert

cym0304 commented 5 years ago

Hello, I want to test the contig polishing. I think the repeat of a same sequence should be clustered to one consensus.

Best regards, cym

rvaser commented 5 years ago

This might occur but there is no guarantee for it. If you have repetitive regions without any overhangs left or right, it is highly likely that there will be multiple consensus sequences with that region because racon picks the best (longest) overlap for each read and then polishes each target sequence. No other procedures are invoked.

Best regards, Robert