paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
66 stars 9 forks source link

Program shasta breakpoint reruns #14

Closed Ban2dao closed 8 months ago

Ban2dao commented 1 year ago

Hi, I recently assembled a genome using shasta program. performance.log showed that the Read graph step was complete(I guess), but when I ran the maker graph step, the program was killed due to I/O. I wonder if shasta has the ability to continue running maker graph step based on previous results;

colindaven commented 1 year ago

Do you mean the marker graph ? I've run shasta a lot and it was only ever killed due to RAM in my knowledge. Who killed it due to IO problems, was it the server admin ?

I don't think you can resume shasta but most multi-GB large assemblies should only take 8-16 hours so that's not a major problem.

Ban2dao commented 1 year ago

Thank you very much for your answer!

I'm not sure what caused the shasta to be killed. The performance.log and stdout.log information is as follows. image image

But that's not my focus, I just want to know if shasta can continue to run from the broken stage if any of the other stages (filter reads, pick Kmer, etc.) are killed, but it doesn't seem to be possible.

In general, large genomes do take only 8-16 hours, but mine is very large (30Gb), and the input reads data is 60X~80X. I set desiredCoverage=1400G, about 40X to assemble, I ran for 45 days, unfortunately, shasta reported wrong in the maker graph steps.

colindaven commented 1 year ago

Ouch, that's huge. You'll need very big RAM machine to succeed (6-12 TB RAM?).

I regularly run a 17 GB genome in <12 hours with different parameters on a 7 year old 80 core 3TB machine, so 45 days sounds ... a bit off.

Are you using the hard disk or SSD for disk based assembly (haven't tried this yet because no admin rights) ? https://paoloshasta.github.io/shasta/Running.html#LowMemory

Ban2dao commented 1 year ago

Yes, I adjusted some parameters to minimize memory, such as minReadLength, desiredCoverage, minBucketSize, maxBucketSize, etc. And the command line input is --memoryMode filesystem --memoryBacking disk, because I don't have root permission;

Here is my configuration file: test.txt

paoloshasta commented 1 year ago

I think @colindaven is correct that the program was killed because of memory usage.

And running with --memoryBacking disk can slow down you a lot, particularly if you are not on SSDs (per @colindaven 's comment). For large assemblies, exclusive use of a large memory machine with root access and --memoryBacking 2M really is the way to go. For a 30 Gb genome at 60X you have about 2 Tb of input, so I agree with @colindaven's estimate that you will need around 10 TB.

The only thing I can add to @colindaven's comments is that for a large genome like this you should consider decreasing marker density - that is, --Kmers.probability 0.05 or lower instead of the standard --Kmers.probability 0.1. Memory requirements will go down almost in proportion, and this usually does not cause large negative effects on assembly quality. However, reducing the marker density means that other assembly options will also probably need to be adjusted. @colindaven may have some experience and suggestions regarding that.

So you might be able to run at reduced marker density on a 5 TB machine. Machines of this size are available on the major cloud computing platforms at reasonable prices - by that I mean that compute cost will be substantially lower than the sequencing cost you had to incur to generate that amount of data. When using a machine on a cloud computing platform you get exclusive access and so you have root privilege.

Alternatively, you could try a 2-4 TB machine with SSD and --memoryBacking disk.

paoloshasta commented 1 year ago

And yes, there is no breakpoint/restart functionality, and it is unlikely that one will be provided in the future.

Ban2dao commented 1 year ago

Ok, I will try to adjust the --Kmers.probability 0.1 to 0.05. How should the assembly options be adjusted?

paoloshasta commented 1 year ago

I can give some suggestions if you post more information:

paoloshasta commented 8 months ago

I am closing this due to lack of discussion. Please create a new issue if additional discussion topics arise.