Open vikramsg opened 3 months ago
Hi @vikramsg! I've seen a similar report about OOM'd tasks being marked as complete and I'm not sure whether the problem is that the tasks are being marked complete incorrectly or whether the report the task has been OOM'd is incorrect. Do you have any metrics that suggest the tasks are really being OOM'd or at least exiting with an error vs just completing? That would help us dig into what the underlying problem is.
Also, I just wanted to note that versions before 1.6.0 are out of support, so you'll want to upgrade sooner rather than later.
Hi @tgross,
Can you tell me a little bit more about what metrics you want to see. The error is very flaky and non-repeatable so I can instrument it to capture issues. Right now I am mostly working with the Nomad REST API, so are there endpoint responses that I can record in the logs?
On the task itself, hopefully the below points answer your questions.
Let me know if that helped.
@vikramsg I'm thinking of out-of-band data like dmesg
logs that show which process, if any, was actually OOM'd. That is, you're saying "just stopped" but in this situation we suspect we can't trust Nomad's report of why that is (otherwise you wouldn't have reported a bug! :grinning: ). So I'm trying to figure out if we can verify that it's really a OOM and not killed for some unrelated reason that's getting misreported as a OOM
Nomad version
Nomad v1.5.2
Operating system and Environment details
Running on AWS.
Issue
We have various batch jobs running on NOMAD which runs on EC2 instances. Now we are connecting up Airflow to Nomad, so we don't want Nomad to handle restarts and reschedules but for this we want to accurately know if a job completed or failed.
This mostly works, but I am seeing on OOM errors that Nomad marks the job as complete.
Expected Result
reschedule
andrestart
blocks set to 0, Nomad is still trying to run the job again.Actual Result
Nomad marks the job as complete and restarts the job.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)