Open ahwanpandey opened 2 weeks ago
Hmm. The error message below is suspicious. I would guess it's a problem related to general memory management on your jobs/cluster.
slurmstepd: error: Detected 167 oom_kill events in StepId=19079835.batch. Some of the step tasks have been OOM Killed.
Hi @teng-gao . Thanks for the reply. I will try to run one sample with just 1 thread/core and see what that looks like. Is there anything in particular you think I could ask the cluster folks re: their memory management? I believe Numbat is the only software/tool I have used that I have seen this type of error in our cluster.
Thanks, Ahwan
And I mean sure I've had segfaults and out of memory issues which have been fixed by providing more memory, but this seems different. And also the memory utilised in the job status is way too high for all the jobs:
Job ID: 19079833 Cluster: rosalind User/Group: apandey@petermac.org.au/apandey State: OUT_OFMEMORY (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 1-11:19:41 CPU Efficiency: 53.46% of 2-18:04:48 core-walltime Job Wall-clock time: 04:07:48 Memory Utilized: 470.93 GB Memory Efficiency: 470.93% of 100.00 GB
Below are some jobs and their States and Memory utilised. Strangely, one of them says State: COMPLETED and the std err has no errors like I've mentioned above, even though it is using a lot more memory than I have asked for.
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 692.82 GB Memory Efficiency: 692.82% of 100.00 GB
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 724.67 GB Memory Efficiency: 724.67% of 100.00 GB
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 433.26 GB Memory Efficiency: 433.26% of 100.00 GB
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 488.84 GB Memory Efficiency: 488.84% of 100.00 GB
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 594.87 GB Memory Efficiency: 594.87% of 100.00 GB
State: COMPLETED (exit code 0) Memory Utilized: 301.44 GB Memory Efficiency: 301.44% of 100.00 GB
State: OUT_OF_MEMORY (exit code 0) Memory Utilized: 561.10 GB Memory Efficiency: 561.10% of 100.00 GB
OK running with just one thread has no issues. Note that I am just using the default "run_numbat" with 'ref_hca" as reference but will be trying with a custom reference as well. The results are vastly different which probably makes sense as a lot of the threads were internally killed by slurm.
Do you think Numbat could benefit from have some sort form error handling for these multi threaded memory issues and not look like it completed? I was testing initially in an interactive session and didn't realise that this was happening in the background. There was no hint in the R terminal about memory issues and threads being killed. The output folder just looks like everything completed without issues, until I submitted the script as a job and checked the std err. Not saying this might be happening to others, but maybe a possibility some other users might have this happening without their knowledge? Hence just suggesting Numbat to notify the user or error out.
But again I might be totally wrong and this could just be a very specific issue with the cluster I am using! I'll talk to the cluster admins about this but would love to hear if you have any specific thoughts on what they could look at as a start.
Job ID: 19083767
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 14:47:00
CPU Efficiency: 99.99% of 14:47:06 core-walltime
Job Wall-clock time: 14:47:06
Memory Utilized: 32.64 GB
Memory Efficiency: 32.64% of 100.00 GB
bulk_clones_final.png log.txt Numbat.AOCS_080_2_2.Step2_run_numbat.19083767.papr-res-compute215.err.txt
Job ID: 19079838
Cluster: rosalind
User/Group: apandey@petermac.org.au/apandey
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 11:45:45
CPU Efficiency: 43.14% of 1-03:16:00 core-walltime
Job Wall-clock time: 01:42:15
Memory Utilized: 366.78 GB
Memory Efficiency: 366.78% of 100.00 GB
bulk_clones_final.png log.txt Numbat.AOCS_080_2_2.Step2_run_numbat.19079838.papr-res-compute215.err.txt
I had a chat with our cluster admin and just wanted to share some thoughts with you.
Seems Numbat has the following memory assumptions:
Am I understanding this right?
It seems to be a similar thing to the one being described below: https://github.com/MonashBioinformaticsPlatform/RNAsik-pipe/issues/39
So if the above is correct, do you think Numbat should break or stop the run if any thread fails, and then exit with an overall error exit code? Like a consensus exit code; if all threads succeeded then 0, otherwise non-zero? And also have some sort of indication that an error occurred during the run in the log file?
Thanks! Ahwan
OK so running with 4 threads and allocating 160Gb let me run the Numbat jobs successfully. I checked the std err for each and no memory issues. Also, the SLURM config in our cluster is setup such that it allows a job to go a little bit over mem, depending on the request/usage of other jobs in the node. Using 16 threads goes way over and starts killing threads as mentioned in the original issue.
State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 210.72 GB
Memory Efficiency: 131.70% of 160.00 GB
State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 203.70 GB
Memory Efficiency: 127.31% of 160.00 GB
State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 170.53 GB
Memory Efficiency: 106.58% of 160.00 GB
State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 181.04 GB
Memory Efficiency: 113.15% of 160.00 GB
State: COMPLETED (exit code 0)
Cores per node: 4
Memory Utilized: 159.93 GB
Memory Efficiency: 99.96% of 160.00 GB
Hello,
Thanks for this tool.
I submitted some "run_numbat" jobs to to our cluster. It seems to have output all the results files and also seems the plots and data files are all there.
But the std err of the job output has a bunch of errors. The job State says "OUT_OF_MEMORY", but with an exit code of 0 meaning it was successful. Also the Memory utilised is 470.93 Gb.
I've attached the log and the std err as follows
log.txt Numbat.AOCS_055_2_0.Step2_run_numbat.19079833.papr-res-compute01.err.txt
Here is my R sessionInfo()
This happens with all the samples I have run so far (about 20). I am just attaching the output of one sample as a reference. The samples have anywhere from 6000 - 22000 cells. For example here is another sample's std err and log:
log.txt Numbat.AOCS_060_2_9.Step2_run_numbat.19079835.papr-res-compute02.err.txt
Not sure if all of this is normal behaviour of the tool or something is wrong?
Thanks so much, Ahwan