rvaser / ra

This repository is deprecated, please use the link to the right.
https://github.com/lbcb-sci/ra
MIT License
23 stars 9 forks source link

benchmarking #2

Open chklopp opened 6 years ago

chklopp commented 6 years ago

Hi,

Could you provide us with information about how long and what memory and disk space the assembly for model species such as E coli, yeast, fruit fly and human take?

Cheers,

rvaser commented 6 years ago

Hello, I have started Ra on Escherichia, Saccharomyces and Drosophila and will report them once they finish (I don't have the human dataset downloaded), but I can also give you an estimate on your requested requirements.

Disk space is bound with the all-vs-all overlaps file which can be quite large for more repetitive genomes. Ra also creates two or three mapping files needed for Racon iterations, but they are much smaller than the all-vs-all overlaps file. At least two assembly files are create as well but I think their size is negligible.

Memory consumption can be described by components. I didn't benchmark Minimap2 but I think its memory usage should be bound by the read set as it parses one of the files chunk by chunk and immediately reports the current results. Rala uses memory equal to the size of the reads set in FASTQ file (might be a little more) or if the input is in FASTA format, than twice the size of it. Racon needs space for the reads file, mapping file and the assembly file.

Execution time is bound by Racon (which is run at least two times, the third time being if Illumina data is present). For larger genomes the all-vs-all overlaps file can be large and as Rala parses it several times it can prolong the whole pipeline.

Please note that Ra is still in development, mostly due to the layout module Rala, and the execution time/memory consumption will probably decrease in the future.

Best regards, Robert

rvaser commented 6 years ago

@chklopp, here are the results you requested:

Dataset Size (GB) Coverage Technology CPU time (s) Memory (GB) Disk space (GB)
E. coli 0.47 53.55 ONT 2257.78 3.47 0.23
S. cerevisiae 1.34 59.90 ONT 6728.22 8.48 1.86
D. melanogaster 29.47 109.54 PacBio 197902.57 59.59 186.54

I forgot that I left some code in Rala which enables easier debugging and the memory consumption is really equal to two times the read set in FASTQ format (that should be decreased to half of it, once it leaves development stage).

For the first two sets, Minimap2 dominates with memory usage, but on D. melanogaster the memory is bounded by Rala.

Best regards, Robert