isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
269 stars 49 forks source link

The genome after polish is getting bigger #148

Open gooalzqshu opened 4 years ago

gooalzqshu commented 4 years ago

Hello, I have a question. When I used racon to polish, the genome increased from 50Mb to 67Mb. Is this normal? The following is my run command: racon -t 10 45_select.fa 45_select.sam assembly.part-45.fa > sample.racon_45.fa The output is like this (it looks like there is no problem):

[racon::Polisher::initialize] loaded target sequences 0.649390 s
[racon::Polisher::initialize] loaded sequences 50.482765 s
[racon::Polisher::initialize] loaded overlaps 92.741777 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 7.385186 s^M[racon::Polisher::initialize] aligning overlaps [==>                 ] 7.385760 s^M[racon::Polisher::initialize] aligning overlaps [===>                ] 7.386218 s^M[racon::Polisher::initialize] aligning overlaps [====>               ] 7.386656 s^M[racon::Polisher::initialize] aligning overlaps [=====>              ] 7.387137 s^M[racon::Polisher::initialize] aligning overlaps [======>             ] 7.387569 s^M[racon::Polisher::initialize] aligning overlaps [=======>            ] 7.388029 s^M[racon::Polisher::initialize] aligning overlaps [========>           ] 7.388505 s^M[racon::Polisher::initialize] aligning overlaps [=========>          ] 7.388959 s^M[racon::Polisher::initialize] aligning overlaps [==========>         ] 7.389442 s^M[racon::Polisher::initialize] aligning overlaps [===========>        ] 7.389892 s^M[racon::Polisher::initialize] aligning overlaps [============>       ] 7.390369 s^M[racon::Polisher::initialize] aligning overlaps [=============>      ] 7.390823 s^M[racon::Polisher::initialize] aligning overlaps [==============>     ] 7.391286 s^M[racon::Polisher::initialize] aligning overlaps [===============>    ] 7.391690 s^M[racon::Polisher::initialize] aligning overlaps [================>   ] 7.392160 s^M[racon::Polisher::initialize] aligning overlaps [=================>  ] 7.392592 s^M[racon::Polisher::initialize] aligning overlaps [==================> ] 7.392996 s^M[racon::Polisher::initialize] aligning overlaps [===================>] 7.393430 s^M[racon::Polisher::initialize] aligning overlaps [====================] 7.393737 s
[racon::Polisher::initialize] transformed data into windows 0.784281 s
[racon::Polisher::polish] generating consensus [=>                  ] 10.510729 s^M[racon::Polisher::polish] generating consensus [==>                 ] 20.447586 s^M[racon::Polisher::polish] generating consensus [===>                ] 34.759954 s^M[racon::Polisher::polish] generating consensus [====>               ] 43.931752 s^M[racon::Polisher::polish] generating consensus [=====>              ] 58.348039 s^M[racon::Polisher::polish] generating consensus [======>             ] 66.458163 s^M[racon::Polisher::polish] generating consensus [=======>            ] 86.537415 s^M[racon::Polisher::polish] generating consensus [========>           ] 98.017490 s^M[racon::Polisher::polish] generating consensus [=========>          ] 104.532434 s^M[racon::Polisher::polish] generating consensus [==========>         ] 112.655939 s^M[racon::Polisher::polish] generating consensus [===========>        ] 123.164506 s^M[racon::Polisher::polish] generating consensus [============>       ] 132.367250 s^M[racon::Polisher::polish] generating consensus [=============>      ] 140.493309 s^M[racon::Polisher::polish] generating consensus [==============>     ] 149.343961 s^M[racon::Polisher::polish] generating consensus [===============>    ] 161.083265 s^M[racon::Polisher::polish] generating consensus [================>   ] 171.559309 s^M[racon::Polisher::polish] generating consensus [=================>  ] 181.877471 s^M[racon::Polisher::polish] generating consensus [==================> ] 196.737828 s^M[racon::Polisher::polish] generating consensus [===================>] 207.276727 s^M[racon::Polisher::polish] generating consensus [====================] 223.445264 s
[racon::Polisher::] total = 376.347604 s
rvaser commented 4 years ago

Hello, it is not uncommon for the genome size to increase with multiple Racon rounds. Maybe the increase here is a bit bigger than usual. Can you paste the number of contigs in your assembly and the average length?

Best regards, Robert

gooalzqshu commented 4 years ago

Hello, Thank you for replying to my email so quickly. In fact, I only did racon polish once. I first cut the genome into about 50 copies, and extracted the raw data and the bam file according to each genome segmented (in order to reduce the running memory) The above example is just a part of the genome. In fact, my whole genome has grown from 2,581Mb to 3,528Mb after correction with racon. I think it is strange. The following are the results of whole genome statistics before and after polish. Before polish:

======================================================================
                          scaffold                   contig
                     length(bp)    number     length(bp)    number
         max_len      3,031,338                3,031,338          
             N10      1,027,631       180      1,027,631       180
             N20        666,029       496        666,029       496
             N30        454,921       966        454,921       966
             N40        326,057     1,638        326,057     1,638
             N50        230,008     2,588        230,008     2,588
             N60        162,325     3,934        162,325     3,934
             N70        115,084     5,820        115,084     5,820
             N80         75,078     8,603         75,078     8,603
             N90         42,445    13,151         42,445    13,151
    Total_length  2,581,092,116            2,581,092,116          
   number>=100bp                   24,130                   24,130
 number>=2,000bp                   23,993                   23,993
======================================================================
         GC_rate                    0.396                    0.396
======================================================================
       Total N bases: 0  ##  Min N: 0  ##  Max N: 0
======================================================================

afrer polish:

======================================================================
                          scaffold                   contig
                     length(bp)    number     length(bp)    number
         max_len      4,162,086                4,162,086          
             N10      1,450,108       176      1,450,108       176
             N20        933,220       483        933,220       483
             N30        639,927       940        639,927       940
             N40        459,758     1,592        459,758     1,592
             N50        325,441     2,512        325,441     2,512
             N60        229,513     3,811        229,513     3,811
             N70        163,186     5,630        163,186     5,630
             N80        107,296     8,299        107,296     8,299
             N90         60,684    12,643         60,684    12,643
    Total_length  3,528,214,396            3,528,214,396          
   number>=100bp                   23,314                   23,314
 number>=2,000bp                   23,259                   23,259
======================================================================
         GC_rate                    0.353                    0.353
======================================================================
       Total N bases: 0  ##  Min N: 0  ##  Max N: 0
======================================================================

Then the following is the partial genomic statistics after segmentation. Before polish:

======================================================================
                          scaffold                   contig
                     length(bp)    number     length(bp)    number
         max_len      2,066,707                2,066,707          
             N10      1,862,708         3      1,862,708         3
             N20      1,263,559         7      1,263,559         7
             N30        704,235        13        704,235        13
             N40        354,171        23        354,171        23
             N50        229,517        42        229,517        42
             N60        165,650        68        165,650        68
             N70         99,456       108         99,456       108
             N80         60,617       173         60,617       173
             N90         27,013       308         27,013       308
    Total_length     51,392,084               51,392,084          
   number>=100bp                      589                      589
 number>=2,000bp                      588                      588
======================================================================
         GC_rate                    0.396                    0.396
======================================================================
       Total N bases: 0  ##  Min N: 0  ##  Max N: 0
======================================================================

afrer polish:

======================================================================
                          scaffold                   contig
                     length(bp)    number     length(bp)    number
         max_len      2,870,983                2,870,983          
             N10      2,579,573         3      2,579,573         3
             N20      1,753,175         7      1,753,175         7
             N30      1,007,202        12      1,007,202        12
             N40        537,959        22        537,959        22
             N50        331,771        39        331,771        39
             N60        242,000        64        242,000        64
             N70        146,608       101        146,608       101
             N80         92,793       161         92,793       161
             N90         38,692       281         38,692       281
    Total_length     69,930,989               69,930,989          
   number>=100bp                      568                      568
 number>=2,000bp                      568                      568
======================================================================
         GC_rate                    0.354                    0.354
======================================================================
       Total N bases: 0  ##  Min N: 0  ##  Max N: 0
======================================================================
rvaser commented 4 years ago

Which tool did you use to create the sam file? What is the coverage of you read set?

gooalzqshu commented 4 years ago

I use minimap2 to map, my reads coverage on the genome is working by bedtools genomecov -d -split, It may take some time,Here are the commands I am mapping

minimap2 -t 15 -ax map-pb --secondary=no sample.contigs.fasta used_3row.part-056.fa | samtools view -@ 15 -bS -t sample.contigs.fasta.fai - -o minimap_56.bam

Thank you very much !

rvaser commented 4 years ago

Well, I am not sure what to tell you. You could try passing overlaps in PAF format instead of SAM by discarding the -a parameter in minimap2, and see if the same happens. On the other hand, what is the expected genome size?

gooalzqshu commented 4 years ago

Ok, thank you. I will try to output the paf file instead of the sam file. However, before and after the polish, I definitely hope that the genome size does not change significantly.In addition, I calculated that the coverage of the raw reads on the genome is 98.9%. Thank you for your advice.

Best regards, Zqshu