Task getting killed with OOM error is marked as complete

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.81k stars 1.94k forks source link

Task getting killed with OOM error is marked as complete #23412

Open vikramsg opened 3 months ago

vikramsg commented 3 months ago

Nomad version

Nomad v1.5.2

Operating system and Environment details

Running on AWS.

Issue

We have various batch jobs running on NOMAD which runs on EC2 instances. Now we are connecting up Airflow to Nomad, so we don't want Nomad to handle restarts and reschedules but for this we want to accurately know if a job completed or failed.

This mostly works, but I am seeing on OOM errors that Nomad marks the job as complete. Screenshot 2024-06-21 at 16 57 18

Expected Result

If a job fails due to Nomad killing it, it should not be marked as complete.
Alternatively how do we determine if it was killed due to OOM.
Also, even though we have reschedule and restart blocks set to 0, Nomad is still trying to run the job again.

    reschedule {
      attempts  = 0
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

Actual Result

Nomad marks the job as complete and restarts the job.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented 3 months ago

Hi @vikramsg! I've seen a similar report about OOM'd tasks being marked as complete and I'm not sure whether the problem is that the tasks are being marked complete incorrectly or whether the report the task has been OOM'd is incorrect. Do you have any metrics that suggest the tasks are really being OOM'd or at least exiting with an error vs just completing? That would help us dig into what the underlying problem is.

Also, I just wanted to note that versions before 1.6.0 are out of support, so you'll want to upgrade sooner rather than later.

vikramsg commented 3 months ago

Hi @tgross,

Can you tell me a little bit more about what metrics you want to see. The error is very flaky and non-repeatable so I can instrument it to capture issues. Right now I am mostly working with the Nomad REST API, so are there endpoint responses that I can record in the logs?

On the task itself, hopefully the below points answer your questions.

The first allocation of the task did not complete, just stopped and sent a completion response in the REST API.
So, Airflow which essentially polls the REST API thinks the job succeeded.
However, Nomad seems to have created a second allocation even though I set restart and reschedule to 0.
But since I already got the completion response, I do not wait and poll for this new allocation to complete.

Let me know if that helped.

tgross commented 2 months ago

@vikramsg I'm thinking of out-of-band data like dmesg logs that show which process, if any, was actually OOM'd. That is, you're saying "just stopped" but in this situation we suspect we can't trust Nomad's report of why that is (otherwise you wouldn't have reported a bug! :grinning: ). So I'm trying to figure out if we can verify that it's really a OOM and not killed for some unrelated reason that's getting misreported as a OOM