paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
74 stars 11 forks source link

advice on using suboptimal memory settings #10

Closed RvV1979 closed 1 year ago

RvV1979 commented 1 year ago

From the documentation, I understand that for optimal performance access to a single machine with large memory is required. I have access to a shared machine with 755G memory but for obvious reasons do not have root access. Using the default suboptimal memory settings, my assembly of a heterozygous plant genome runs very fast and contig size is good enough for my purposes. However, I want to avoid assembly errors.

Therefore, I would like to ask how the default mode --memoryMode anonymous --memoryBacking 4K affect assembly results? I read "typically 30% degredation" but am unsure what that means, exactly. Will contigs just be shorter, or will there be errors?

I hope you can clarify and give me some advice.

Thanks

paoloshasta commented 1 year ago

The memory options don't affect assembly quality in any way. The statement about 30% degradation refers to performance in terms of assembly time only.

I apologize for lack of clarity of the documentation on this point and I will make some changes.

RvV1979 commented 1 year ago

Thanks for the reassuring clarifications. Just for your information: my worry about degradation also stemmed from the warning from stdout, below:

This run used options "--memoryBacking 4K --memoryMode anonymous".
This could have resulted in performance degradation.
For full performance, use "--memoryBacking 2M --memoryMode filesystem"
(root privilege via sudo required).
Therefore the results of this run should not be used
for benchmarking purposes.
paoloshasta commented 1 year ago

In that case too, "performance" refers to assembly time only. I changed that message this morning to clarify this point. The new wording of the message will be in the next release. Thank you for reporting this - you had a valid point because the term "performance" in genomics is often used to really mean "quality" (in computer science, it typically just means "speed").

paoloshasta commented 1 year ago

The commit with the message change is here.