hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Job is "running" with no allocations. Is it expected? #10359

Open Oloremo opened 3 years ago

Oloremo commented 3 years ago

Nomad version

Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)

Operating system and Environment details

Centos 7

Issue

I started type=system job and it reports like a running without any allocations. I still try to figure out what is wrong with my definition but I wonder if that is expected from the presentation\UX point of view.

$nomad status -verbose -namespace=system vector
ID            = vector
Name          = vector
Submit Date   = 2021-04-12T10:11:07Z
Type          = system
Priority      = 50
Datacenters   = test-run-kproskurin-rm
Namespace     = system
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
vector      0       0         0        0       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
192438bb-720d-b2b5-167f-cf0127ba745c  50        job-register  complete  false

Allocations
No allocations placed

Reproduction steps

Expected Result

No allocation == not running?..

Actual Result

No allocation == running

Job file

https://gist.github.com/Oloremo/ce78aae9b1957e06db067f97628fa6d4

UPD: Ok so the issue is the following:

If you specify a non-existent dc in the system job definition it will be "created": image (2)

And allocation will be in the state described above.

So here we observe at least a few issues:

  1. Nomad's actions in case a non-existent(not defined in agent configuration) datacenter configuration in a job. Imo it should fail during the plan.
  2. Job is started but the scheduler is unable to allocate it but there is no error or any problem indication.
  3. Seems like all that related only to the system jobs.
shoenig commented 3 years ago

Thanks for reporting @Oloremo, indeed this behavior is odd for system jobs, whereas service and batch jobs correctly warn on plan and refuse to be scheduled due to the missing DC. I'm thinking we should have Nomad treat system jobs the same as the others.

eshcheglov commented 1 year ago

The same story here.

Suddenly, my job stopped working, even though its status still shows status as "RUNNING" but in detailed view, I see "0 Running".

And I can't view any logs because I get "20 failed" allocations, but when I click on them, Nomad reports "No allocations have been placed."

Also I see "3 Not Scheduled" but no idea why. Clicking on it redirects me to "Clients' page without any extra details

It's a total mess. How can I properly view the logs?

P.S. DC naming is correct


Screenshot from 2023-07-25 13-27-51

Screenshot from 2023-07-25 13-28-02

Screenshot from 2023-07-25 13-27-58