Detecting RAM killed Dorado runs

jcolicchio-soundag commented 1 year ago

We have recently been trying out running Dorado as part of an automated pipeline to do methylation calling using the all cytosine model from some recent Promethion runs. We have noticed that on our current cluster configurations with only about ~60GB of RAM our Dorado runs consistently fail after running for about ~36 hours. We have tracked this, and it is clear that Dorado gets killed due to it running out of RAM. I assume this is related to the way that Dorado holds the large .bam or intermediate files in working memory.

Luckily, we have been able to get past this by using the new resume function (thanks a ton for this feature!), but we have been unable to come up with a simple way to detect if a dorado run failed and needs to be re-run using --resume. We were hoping that it would be possible to have the fact that a dorado run was "killed" written to a standard output file, or alternatively having the progress bar saved as described in #307 . Sadly we have been unable to figure out a way to automatically detect if a run failed in the middle and needs to be re-run with resume.

Does anyone have recomendations on how to (A) run Dorado such that it does not get as RAM hungry, and can be run on a machine with about 60GB of RAM without crashing, or (B) how to automatically detect if a dorado run did get killed and needs tro be resumed?

Thanks, Jack

vellamike commented 1 year ago

Hi @jcolicchio-soundag,

Thank you for reaching out with this detailed information about the issue you're facing with Dorado. Gradually increasing memory utilization during methylation calling is a known issue. We've received similar reports, and our team is actively investigating it.

As for detecting whether a Dorado run was killed due to running out of memory, this is dependent on your operating system and cluster manager. Here are a couple of specific approaches you might consider:

Using dmesg: You can search the system logs for lines containing "Killed" to identify if the process was terminated due to memory issues.
Job Scheduler Utilities: If you're using a specific job scheduler like SLURM, tools such as the sacct command can reveal information about job termination, including memory-related problems.

However, addressing the root cause might be more effective. Running multiple smaller Dorado jobs on subsets of the POD5s might be a strategic approach. This would not only align better with typical cluster management practices (where long-running jobs can be problematic) but also provide more resilience and manageability.

Here's how you could proceed:

Divide the Data: Break the data into smaller POD5s that can be processed individually (Your data may already be in multiple small POD5s?), place one or more POD5s into their own directory.
Run Parallel Jobs: Execute multiple Dorado instances on these directories of POD5s, thus ensuring each job consumes a manageable amount of memory and runs for a limited time.
Merge Results: Combine the resultant BAM files as needed.

This approach might require adjustments to your current pipeline, but it will be much more resilient.

Best wishes, Mike

tijyojwad commented 1 year ago

how to automatically detect if a dorado run did get killed and needs tro be resumed?

Another method could be to check for the exit code of dorado. When dorado completes successfully, the exit code will be 0. If it errors out for any reason (killed by OS due to OOM is code 137 on linux) the exit code will be non-zero. So your script could check for non zero exit code (on Linux the $? variable holds the exit code of the previous cmd) and keep retrying till you get a successful run.

This of course is not an ideal solution. As Mike mentioned, we're looking into addressing some memory issues. However, 60GB is not a lot of memory, so it's possible the peak memory is just hitting that number. Are you running single or multi GPU basecalling? Is 60GB the maximum memory your job can request?

tijyojwad commented 1 year ago

Are you also running alignment during basecalling?

jcolicchio-soundag commented 1 year ago

Thanks a ton all!

A lot of good thoughts here. We definitely understand our current cluster config is NOT memory optimized, and was built for previous pipelines that were not RAM intensive. That said we have recently ordered more RAM and will be upgrading the nodes to ~84GB ram each, which, while not a huge increase will hopefully alleviate some of these issues.

We are running single GPU basecalling, and doing this with mapping to a reference as well, since our end goal is a bedMethyl file mapped to a genome created using modbamtobed.

As far as short term options, we are not using SLURM, so that wont do the trick, but do think that looking up the exit code might be the easiest solution. If this doesn't work we will divide the data and proceed that way!

Either way, would be great if there was some way that Dorado could work on smaller RAM machines, or at least give a warning/a standard output that allows the user to detect a failed run in an automated way.

Best, Jack

tijyojwad commented 1 year ago

Hi @jcolicchio-soundag,

We have seen that running alignment during basecalling increases the memory footprint. We will check if the memory during alignment can be capped, but it may take us a while to get to it. If you have some spare cycles, it would be useful to check what the memory footprint is without alignment. It may make sense to do alignment as a downstream step in your pipeline using minimap2 to make dorado runs more stable.

nanoporetech / dorado

Detecting RAM killed Dorado runs #320