isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

Memory consumption #71

Closed zhouyiqi91 closed 6 years ago

zhouyiqi91 commented 6 years ago

Hi, I have been trying to use racon to polish a plant genome.The assembled genome size is 711 Mb, the pacbio reads in fasta format is 88Gb and the paf file is 2.6Gb.
My machine has ~120Gb RAM. When I run the following command: racon -t 30 all_reads.fasta all.paf genome.fasta It consumes more than 120Gb memory and the process gets killed. So I use the wapper script: python racon_wrapper -t 30 --split 200000000 all_reads.fasta all.paf genome.fasta The genome.fasta is splitted into 4 parts but the memory consumption is still very high. I have noticed that racon will load the reads before paf file. If I extract the reads which are mapped to part1.fasta instead of using all the reads, will it decrease the memory usage? Thank you.

rvaser commented 6 years ago

Hello, is the 88Gb read file compressed or plain FASTA format? What is the output log before the process gets killed?

Best regards, Robert

zhouyiqi91 commented 6 years ago

1.The 88 Gb FASTA file is not compressed. 2.The ouput log is: [racon::Polisher::initialize] loaded target sequences [racon::Polisher::initialize] loaded sequences [racon::Polisher::initialize] loaded overlaps /opt/gridengine/default/spool/node291/job_scripts/1903746: line 9: 29186 Killed racon -t 30 ./pre_racon/all.fasta ./pre_racon/all.paf genome.fasta

rvaser commented 6 years ago

It is a bit odd that it does not fit in 120GB RAM. Even the file splitting is killed?

You can extract reads for each of the 4 parts or you can use the subsample option and use 60x coverage instead of your initial ~100x. Run the wrapper with --subsample 700000000 60. It might yield a bit lower accuracy when compared to the full read set.