Open chklopp opened 6 years ago
Hello, I have started Ra on Escherichia, Saccharomyces and Drosophila and will report them once they finish (I don't have the human dataset downloaded), but I can also give you an estimate on your requested requirements.
Disk space is bound with the all-vs-all overlaps file which can be quite large for more repetitive genomes. Ra also creates two or three mapping files needed for Racon iterations, but they are much smaller than the all-vs-all overlaps file. At least two assembly files are create as well but I think their size is negligible.
Memory consumption can be described by components. I didn't benchmark Minimap2 but I think its memory usage should be bound by the read set as it parses one of the files chunk by chunk and immediately reports the current results. Rala uses memory equal to the size of the reads set in FASTQ file (might be a little more) or if the input is in FASTA format, than twice the size of it. Racon needs space for the reads file, mapping file and the assembly file.
Execution time is bound by Racon (which is run at least two times, the third time being if Illumina data is present). For larger genomes the all-vs-all overlaps file can be large and as Rala parses it several times it can prolong the whole pipeline.
Please note that Ra is still in development, mostly due to the layout module Rala, and the execution time/memory consumption will probably decrease in the future.
Best regards, Robert
@chklopp, here are the results you requested:
Dataset | Size (GB) | Coverage | Technology | CPU time (s) | Memory (GB) | Disk space (GB) |
---|---|---|---|---|---|---|
E. coli | 0.47 | 53.55 | ONT | 2257.78 | 3.47 | 0.23 |
S. cerevisiae | 1.34 | 59.90 | ONT | 6728.22 | 8.48 | 1.86 |
D. melanogaster | 29.47 | 109.54 | PacBio | 197902.57 | 59.59 | 186.54 |
I forgot that I left some code in Rala which enables easier debugging and the memory consumption is really equal to two times the read set in FASTQ format (that should be decreased to half of it, once it leaves development stage).
For the first two sets, Minimap2 dominates with memory usage, but on D. melanogaster the memory is bounded by Rala.
Best regards, Robert
Hi,
Could you provide us with information about how long and what memory and disk space the assembly for model species such as E coli, yeast, fruit fly and human take?
Cheers,