changing constraints results in "failed to place all allocations"

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.77k stars 1.94k forks source link

changing constraints results in "failed to place all allocations" #12016

Open empikls opened 2 years ago

empikls commented 2 years ago

Nomad version

1.2.5

Nomad job example

job "autoscaler" {
  type        = "system"
  datacenters = ["euc1", "fsn1"]

  constraint {
    attribute = "${node.class}"
    operator  = "regexp"
    value     = "(cloud-cpu-worker|object-detection)"
  }

  constraint {
    operator  = "distinct_property"
    attribute = "${node.datacenter}"
    value     = "1"
  }

  group "autoscaler" {
    network {
      mode = "bridge"
      port "http" {
        host_network = "private"
      }

      port "sidecar" {
        host_network = "private"
      }

      port "stats" {
        host_network = "private"
      }
    }

    service {
      name = "autoscaler"
      port = "http"

      meta {
        envoy_address = "${NOMAD_ADDR_stats}"
      }

      connect {
        sidecar_service {
          port = "sidecar"
          proxy {
            expose {
              path {
                path            = "/metrics"
                protocol        = "http"
                local_path_port = 9102
                listener_port   = "stats"
              }
            }
            config {
              envoy_prometheus_bind_addr = "0.0.0.0:9102"
            }
            upstreams {
              destination_name = "prometheus"
              local_bind_port  = 9091
            }
          }
        }
      }
      check {
        type     = "http"
        path     = "/v1/health"
        interval = "3s"
        timeout  = "1s"

        check_restart {
          limit           = 3
          grace           = "30s"
          ignore_warnings = false
        }
      }
    }

    task "autoscaler" {
      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.3.3"
        command = "nomad-autoscaler"
        ports   = ["http"]

        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.json",
          "-http-bind-address",
          "0.0.0.0",
          "-http-bind-port",
          "${NOMAD_PORT_http}",
        ]
      }

      template {
        data = replace(jsonencode({
          nomad = {
            address = "http://{{ env \"attr.unique.network.ip-address\" }}:4646"
            token   = "{{ key \"secrets/nomad/autoscaler/token\" }}"
          }
          telemetry = {
            prometheus_metrics = true
            disable_hostname   = true
          }
          apm = {
            prometheus = {
              driver = "prometheus"
              config = {
                address = "http://{{ env \"NOMAD_UPSTREAM_ADDR_prometheus\" }}"
              }
            }
          }
          strategy = {
            target-value = {
              driver = "target-value"
            }
            pass-through = {
              driver = "pass-through"
            }
          }
        }), "\\", "")

        destination = "${NOMAD_TASK_DIR}/config.json"
      }

      resources {
        cpu    = 100
        memory = 256
      }
    }
  }
}

.gitlab-ci.yml

image: image

stages:
  - deploy

.deploy:
  allow_failure: false
  script:
    - export SOME VARIABLE
    - for job in $(ls jobs/*); do 
        echo "Deploying $job";
        nomad job run $job; 
      done
  when: manual
  environment:
    name: $ENVIRONMENT
    url: some url

deploy to qa-1:
  stage: deploy
  extends: .deploy
  variables:
    NOMAD_TOKEN: $QA_1_NOMAD_TOKEN
    ENVIRONMENT: qa-1

Issue

When trying to deploy job via gitlab job failed when allocation is already placed or cannot find any constraint that can be missed nomad-qa-2 nomad job status autoscaler

ID            = autoscaler
Name          = autoscaler
Submit Date   = 2022-02-07T15:27:27+02:00
Type          = system
Priority      = 50
Datacenters   = euc1,fsn1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
autoscaler  0       0         2        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
3880e81a  06f66f20  autoscaler  0        run      running  12m38s ago  12m36s ago
4314feca  58b148c7  autoscaler  0        run      running  12m38s ago  12m34s ago

Output from gitlab:

$ for job in $(ls jobs/*); do echo "Deploying $job"; nomad job run $job; done
Deploying jobs/autoscaler.nomad
==> Monitoring evaluation "11fb88cc"
    Evaluation triggered by job "autoscaler"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "11fb88cc" finished with status "complete" but failed to place all allocations:
    Task Group "autoscaler" (failed to place 1 allocation):
      * Class "cloud-cpu-worker": 1 nodes excluded by filter
      * Class "stream-processing": 1 nodes excluded by filter
      * Class "cloud-cache": 1 nodes excluded by filter
      * Class "cache": 1 nodes excluded by filter
      * Constraint "${node.class} regexp (cloud-cpu-worker|object-detection)": 3 nodes excluded by filter
      * Constraint "distinct_property: ${node.datacenter}=euc1 used by 1 allocs": 1 nodes excluded by filter
Cleaning up file based variables
00:02
ERROR: Job failed: exit code 1

tgross commented 2 years ago

Hi @empikls! The evaluation data should have more details. Right now we don't have a nice and clean single-eval inspect but if you use nomad eval list -json | jq '. | select(.ID | startswith("11fb88cc"))' that should give you a JSON blob with a bunch more data about where the scheduler is doing something you might not expect.

That being said, I'm looking at this and have a few questions:

 datacenters = ["euc1", "fsn1"]

 constraint {
    attribute = "${node.class}"
    operator  = "regexp"
    value     = "(cloud-cpu-worker|object-detection)"
  }

  constraint {
    operator  = "distinct_property"
    attribute = "${node.datacenter}"
    value     = "1"
  }

It looks like you had 2 allocs placed in version 0 of the job. Is version 0 the version that had the partial failure to place allocs?
I see this is a system job? Having distinct_property of one-per-DC seems to contradict having it as a system job, which is intended to place on every eligible node in the DC.

(Aside: I hope you don't mind, but I edited your question to use triple backticks for the code blocks and shell output; that'll make them easier to read.)

empikls commented 2 years ago

@tgross Yes, the job version is 0, and I didn't specify the version before (when I used nomad 1.0.1).

New output from gitlab:

$ for job in $(ls jobs/*); do echo "Deploying $job"; nomad job run $job; done
Deploying jobs/autoscaler.nomad
==> Monitoring evaluation "45d70a55"
    Evaluation triggered by job "autoscaler"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "45d70a55" finished with status "complete" but failed to place all allocations:
    Task Group "autoscaler" (failed to place 1 allocation):
      * Class "cache": 1 nodes excluded by filter
      * Class "cloud-cpu-worker": 1 nodes excluded by filter
      * Class "stream-processing": 1 nodes excluded by filter
      * Class "cloud-cache": 1 nodes excluded by filter
      * Constraint "${node.class} regexp (cloud-cpu-worker|object-detection)": 3 nodes excluded by filter
      * Constraint "distinct_property: ${node.datacenter}=euc1 used by 1 allocs": 1 nodes excluded by filter
Cleaning up file based variables
00:02
ERROR: Job failed: exit code 1

There you can find JSON blob with a bunch more details :

"ID": "45d70a55-e5b2-5e77-2f95-4c4cd143111e", "JobID": "autoscaler", "JobModifyIndex": 952215, "ModifyIndex": 962495, "ModifyTime": 1644256453590096000, "Namespace": "default", "NextEval": "", "NodeID": "", "NodeModifyIndex": 0, "PreviousEval": "", "Priority": 50, "QueuedAllocations": { "autoscaler": 0 }, "QuotaLimitReached": "", "SnapshotIndex": 962494, "Status": "complete", "StatusDescription": "", "TriggeredBy": "job-register", "Type": "system", "Wait": 0, "WaitUntil": null }

nomad eval status 45d70a55

ID                 = 45d70a55
Create Time        = 23m59s ago
Modify Time        = 23m59s ago
Status             = complete
Status Description = complete
Type               = system
TriggeredBy        = job-register
Job ID             = autoscaler
Priority           = 50
Placement Failures = true

Failed Placements
Task Group "autoscaler" (failed to place 1 allocation):
  * Class "cache": 1 nodes excluded by filter
  * Class "cloud-cpu-worker": 1 nodes excluded by filter
  * Class "stream-processing": 1 nodes excluded by filter
  * Class "cloud-cache": 1 nodes excluded by filter
  * Constraint "${node.class} regexp (cloud-cpu-worker|object-detection)": 3 nodes excluded by filter
  * Constraint "distinct_property: ${node.datacenter}=euc1 used by 1 allocs": 1 nodes excluded by filter

Answering your questions : 1) Now i have 2 allocs of autoscaler job with version 0 and with eval ID 5c0c0297-2306-7642-fa5f-8378e7872cd1. There is JSON blob { "AnnotatePlan": false, "BlockedEval": "", "ClassEligibility": null, "CreateIndex": 952215, "CreateTime": 1644242162351499800, "DeploymentID": "", "EscapedComputedClass": false, "FailedTGAllocs": null, "ID": "5c0c0297-2306-7642-fa5f-8378e7872cd1", "JobID": "autoscaler", "JobModifyIndex": 952215, "ModifyIndex": 952217, "ModifyTime": 1644242162370615000, "Namespace": "default", "NextEval": "", "NodeID": "", "NodeModifyIndex": 0, "PreviousEval": "", "Priority": 50, "QueuedAllocations": { "autoscaler": 0 }, "QuotaLimitReached": "", "SnapshotIndex": 952215, "Status": "complete", "StatusDescription": "", "TriggeredBy": "job-register", "Type": "system", "Wait": 0, "WaitUntil": null }

An example of the alloc status of one of the alloc of job autoscaler :

nomad alloc status -verbose 8a761570

ID                  = 8a761570-0895-cfb5-f44c-88c35702eff5
Eval ID             = 5c0c0297-2306-7642-fa5f-8378e7872cd1
Name                = autoscaler.autoscaler[0]
Node ID             = 58b148c7-4c48-5d62-211e-04e301a2e762
Node Name           = ip
Job ID              = autoscaler
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2022-02-07T15:56:02+02:00
Modified            = 2022-02-07T15:56:04+02:00
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 63.591µs
Failures            = 0

Allocation Addresses (mode = "bridge")
Label     Dynamic  Address
*http     yes      
*sidecar  yes     
*stats    yes      

Task "connect-proxy-autoscaler" (prestart sidecar) is "running"
Task Resources
CPU        Memory          Disk     Addresses
2/250 MHz  15 MiB/128 MiB  300 MiB  

Task Events:
Started At     = 2022-02-07T13:56:03Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-02-07T15:56:03+02:00  Started     Task started by client
2022-02-07T15:56:03+02:00  Task Setup  Building Task Directory
2022-02-07T15:56:02+02:00  Received    Task received by client

Task "autoscaler" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  7.6 MiB/256 MiB  300 MiB  

Task Events:
Started At     = 2022-02-07T13:56:04Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-02-07T15:56:04+02:00  Started     Task started by client
2022-02-07T15:56:03+02:00  Task Setup  Building Task Directory
2022-02-07T15:56:02+02:00  Received    Task received by client

Placement Metrics
Node                                  binpack  final score
58b148c7-4c48-5d62-211e-04e301a2e762  0.915    0.915

2) Yes, it's a system job.

P.S. I appreciate you editing the question as it didn't work for me initially.

tgross commented 2 years ago

Hi @empikls! I was able to reproduce this and it looks like it has something to do with adding and removing constraints. Suppose I have a jobspec like the following:

jobspec

```hcl job "example" { datacenters = ["dc1", "dc2"] type = "system" group "web" { constraint { operator = "distinct_property" attribute = "${node.datacenter}" value = "1" } task "http" { driver = "docker" config { image = "busybox:1" command = "httpd" args = ["-v", "-f", "-p", "8001", "-h", "/var/www"] } resources { cpu = 128 memory = 128 } } } } ```

When I run this it works with no problem:

$ nomad job status example
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
043f7c0a  64862576  web         0        run      running  11s ago  11s ago
56fe29d8  689f56f1  web         0        run      running  11s ago  11s ago

But if I add this constraint, which all nodes should accept because my whole cluster is in this node class:

    constraint {
      attribute = "${node.class}"
      operator  = "="
      value     = "vagrant"
    }

Then I get an error very similar to what you saw:

$ nomad job run ./jobs/system-per-dc.nomad
==> 2022-02-08T16:17:44-05:00: Monitoring evaluation "7d4d950e"
    2022-02-08T16:17:44-05:00: Evaluation triggered by job "example"
==> 2022-02-08T16:17:45-05:00: Monitoring evaluation "7d4d950e"
    2022-02-08T16:17:45-05:00: Allocation "56fe29d8" modified: node "689f56f1", group "web"
    2022-02-08T16:17:45-05:00: Allocation "043f7c0a" modified: node "64862576", group "web"
    2022-02-08T16:17:45-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-08T16:17:45-05:00: Evaluation "7d4d950e" finished with status "complete" but failed to place all allocations:
    2022-02-08T16:17:45-05:00: Task Group "web" (failed to place 1 allocation):
      * Class "vagrant": 1 nodes excluded by filter
      * Constraint "distinct_property: ${node.datacenter}=dc1 used by 2 allocs": 1 nodes excluded by filter

What's even stranger is that if I then remove the constraint again, I get the same error message! That persists until I change the resources or something else in the job.

This also happens in reverse. If I purge the job and start over with the constraint, it works fine. But then if I remove the constraint it fails in the same way.

Something to note for a workaround for you: if you only want 1 per DC, you could also run this as a service job. If I run the following job with where count == the number DCs, then the problem goes away. I can add and remove the node class constraint and it seems to work just fine every time.

jobspec

```hcl job "example" { datacenters = ["dc1", "dc2"] type = "service" group "web" { count = 2 # constraint { # attribute = "${node.class}" # operator = "=" # value = "vagrant" # } constraint { operator = "distinct_property" attribute = "${node.datacenter}" value = "1" } task "http" { driver = "docker" config { image = "busybox:1" command = "httpd" args = ["-v", "-f", "-p", "8001", "-h", "/var/www"] } resources { cpu = 128 memory = 128 } } } } ```

I've run into a similar issue on spread recently, so it may turn out this is related. I'll mark this issue for further investigation. But in the meanwhile, would the workaround of using a service job with count = 2 work for you?

empikls commented 2 years ago

Hi @tgross ! I did some research and found that when I changed the job type from system to service, there were no issues with the deployment, but when I switched again, the problem found out again. I would like to leave everything as it is, and the option with count = 2 is not suitable for me, since there are cases when the constraint is not executed , and the gitlab job must still continue to deploy the next job in the list. Example :

variable "resources" {
  type = map(map(number))
  default = {
    kafka = {
      memory = 2048
      cpu    = 2000
    }
  }
}

job "kafka" {
  datacenters = ["euc1", "fsn1"]
  type        = "system"

  constraint {
    attribute = "${node.class}"
    operator  = "regexp"
    value     = "(cloud-)?kafka"
  }

  group "kafka" {
    network {
      mode = "host"

      port "public" {
        to           = 9093
        static       = 9093
        host_network = "public"
      }

      port "private" {
        to           = 9094
        static       = 9094
        host_network = "private"
      }

      port "exporter" {
        to           = 8080
        static       = 8080
        host_network = "private"
      }
    }

    constraint {
      operator = "distinct_hosts"
      value    = "true"
    }

    ...........

A node class like stage-fsn1-kafka-2 can only be in the prod and stage environment.

ID              = 8523c7ce-86d8-6756-714e-482593b29ef6
Name            = stage-fsn1-kafka-0
Class           = kafka
DC              = fsn1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2139h14m55s
Host Volumes    = consul_socket,kafka_data,zookeeper_data,zookeeper_datalog
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2022-02-04T13:05:47Z  Cluster    Node reregistered by heartbeat
2022-02-04T13:05:20Z  Cluster    Node heartbeat missed
2021-11-22T15:01:28Z  Cluster    Node reregistered by heartbeat
2021-11-22T15:00:44Z  Cluster    Node heartbeat missed
2021-11-17T18:34:11Z  Cluster    Node reregistered by heartbeat
2021-11-17T18:32:15Z  Cluster    Node heartbeat missed
2021-11-17T18:31:27Z  Cluster    Node reregistered by heartbeat
2021-11-17T18:31:27Z  Cluster    Node heartbeat missed
2021-11-12T10:16:23Z  Cluster    Node re-registered
2021-11-12T10:16:17Z  Cluster    Node heartbeat missed

Allocated Resources
CPU            Memory           Disk
3250/4588 MHz  3.6 GiB/3.6 GiB  900 MiB/32 GiB

Allocation Resource Utilization
CPU          Memory
44/4588 MHz  1.0 GiB/3.6 GiB

Host Resource Utilization
CPU           Memory           Disk
187/4588 MHz  1.4 GiB/3.6 GiB  3.7 GiB/38 GiB

Allocations
ID        Node ID   Task    Group       Version  Desired  Status   Created    Modified
3998f938  8523c7ce  zookeeper        3        run      running  5d25m ago  5d25m ago
1de0689f  8523c7ce  kafka            1        run      running  5d25m ago  1d6h ago
1baf6218  8523c7ce  consul_exporter  0        run      running  5d25m ago  5d25m ago

And when I try to deploy the jobs through gitlab, I get this error like that ( doesn't mattersystem or service job type; error might be a bit another ) :

==> Evaluation "c3b5c9b7" finished with status "complete" but failed to place all allocations:
    Task Group "kafka" (failed to place 1 allocation):
      * Class "cache": 1 nodes excluded by filter
      * Class "stream-processing": 1 nodes excluded by filter
      * Class "object-detection": 1 nodes excluded by filter
      * Class "cloud-cache": 1 nodes excluded by filter
      * Class "cloud-cpu-worker": 2 nodes excluded by filter
      * Constraint "${node.class} regexp (cloud-)?kafka": 6 nodes excluded by filter

So i can't complete the gitlab job of deploying all nomad jobs ( autoscaler, kafka, redis, etc) that follows the failed one. Aside: when I was using Nomad version 1.0.1, I saw something like this in the qa-1 env:

ID            = kafka
Name          = kafka
Submit Date   = 2021-11-12T13:15:40+02:00
Type          = system
Priority      = 50
Datacenters   = euc1,fsn1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
kafka       0       0         0        0       0         0

Allocations
No allocations placed

and something like this in stage env :

ID            = kafka
Name          = kafka
Submit Date   = 2021-11-18T13:36:18+02:00
Type          = system
Priority      = 50
Datacenters   = euc1,fsn1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
kafka       0       0         3        1       5         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1de0689f  8523c7ce  kafka       1        run      running  5d16m ago  1d6h ago
2d5f1fc1  4a044431  kafka       1        run      running  28d3h ago  1d6h ago
cb308c2b  f91f9b69  kafka       1        run      running  2mo8d ago  1d6h ago

jrnijboer commented 2 years ago

I believe we are running into the same problem. We have system jobs that have extra constraints and when we run the job from our pipeline we get an error if there are no changes in the job file. Of course, we should not need to submit the exact same job twice, but I also think that Nomad should not give errors if we do. I have copied the jobspec that @tgross created on feb. 8 (here https://github.com/hashicorp/nomad/issues/12016#issuecomment-1033082994) and saved it to a file called busybox.job. I needed to edit the datacenters, as we don't have a dc1 or dc2 and then sent this twice to our Nomad cluster. You can see the output below. First time everything is ok, second time we get an error about constraints and the nomad run command exits with error code 2:

nomad run busybox.job
==> 2022-02-24T16:31:56+01:00: Monitoring evaluation "4e579128"
    2022-02-24T16:31:56+01:00: Evaluation triggered by job "example"
==> 2022-02-24T16:31:57+01:00: Monitoring evaluation "4e579128"
    2022-02-24T16:31:57+01:00: Allocation "4487a627" created: node "44ef7387", group "web"
    2022-02-24T16:31:57+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-24T16:31:57+01:00: Evaluation "4e579128" finished with status "complete"

nomad run busybox.job
==> 2022-02-24T16:32:18+01:00: Monitoring evaluation "6c224097"
    2022-02-24T16:32:18+01:00: Evaluation triggered by job "example"
    2022-02-24T16:32:18+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-24T16:32:18+01:00: Evaluation "6c224097" finished with status "complete" but failed to place all allocations:
    2022-02-24T16:32:18+01:00: Task Group "web" (failed to place 1 allocation):
      * Constraint "distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24 nodes excluded by filter
      * Constraint "missing drivers": 10 nodes excluded by filter

echo $?
2

This is the output of nomad eval status of the second run:

nomad eval status 6c224097
ID                 = 6c224097
Create Time        = 22s ago
Modify Time        = 22s ago
Status             = complete
Status Description = complete
Type               = system
TriggeredBy        = job-register
Job ID             = example
Priority           = 50
Placement Failures = true

Failed Placements
Task Group "web" (failed to place 1 allocation):
  * Constraint "distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24 nodes excluded by filter
  * Constraint "missing drivers": 10 nodes excluded by filter

And this is part of output from nomad eval list -json:

nomad eval list -json

 {
        "AnnotatePlan": false,
        "BlockedEval": "",
        "ClassEligibility": null,
        "CreateIndex": 3247051,
        "CreateTime": 1645716738898975777,
        "DeploymentID": "",
        "EscapedComputedClass": false,
        "FailedTGAllocs": {
            "web": {
                "AllocationTime": 240496,
                "ClassExhausted": null,
                "ClassFiltered": {},
                "CoalescedFailures": 0,
                "ConstraintFiltered": {
                    "distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24,
                    "missing drivers": 10
                },
                "DimensionExhausted": null,
                "NodesAvailable": null,
                "NodesEvaluated": 34,
                "NodesExhausted": 0,
                "NodesFiltered": 34,
                "QuotaExhausted": null,
                "ResourcesExhausted": null,
                "ScoreMetaData": null,
                "Scores": null
            }
        },
        "ID": "6c224097-6ddb-1055-6767-ea0432934c8b",
        "JobID": "example",
        "JobModifyIndex": 3247045,
        "ModifyIndex": 3247052,
        "ModifyTime": 1645716738903368401,
        "Namespace": "default",
        "NextEval": "",
        "NodeID": "",
        "NodeModifyIndex": 0,
        "PreviousEval": "",
        "Priority": 50,
        "QueuedAllocations": {
            "web": 0
        },
        "QuotaLimitReached": "",
        "SnapshotIndex": 3247051,
        "Status": "complete",
        "StatusDescription": "",
        "TriggeredBy": "job-register",
        "Type": "system",
        "Wait": 0,
        "WaitUntil": null
    },
    {
        "AnnotatePlan": false,
        "BlockedEval": "",
        "ClassEligibility": null,
        "CreateIndex": 3247045,
        "CreateTime": 1645716716359085223,
        "DeploymentID": "",
        "EscapedComputedClass": false,
        "FailedTGAllocs": null,
        "ID": "4e579128-c83b-773e-6888-d45ffd7b4e09",
        "JobID": "example",
        "JobModifyIndex": 3247045,
        "ModifyIndex": 3247047,
        "ModifyTime": 1645716716712492756,
        "Namespace": "default",
        "NextEval": "",
        "NodeID": "",
        "NodeModifyIndex": 0,
        "PreviousEval": "",
        "Priority": 50,
        "QueuedAllocations": {
            "web": 0
        },
        "QuotaLimitReached": "",
        "SnapshotIndex": 3247045,
        "Status": "complete",
        "StatusDescription": "",
        "TriggeredBy": "job-register",
        "Type": "system",
        "Wait": 0,
        "WaitUntil": null
    }

I have tried to setup a small testing environment to reproduce this error, but then I'm not yet able to get this message. If I send the same job twice to this Nomad testing cluster, Nomad just accepts the job and does nothing which is to be expected since it's the same job. I'm not sure what the difference could be between this testing cluster and our other cluster that gives an error. We are running Nomad 1.2.4 on our servers and clients.

empikls commented 2 years ago

I don't know what information to add more. But the problem is still there. If there are no problems in a small test environment, this doesn't mean that there are no problems. I believe the changes that happened in version 1.2.4 (or earlier versions) are the reason for this, because I have this pipeline running stable in version 1.1.0.

trilom commented 2 years ago

I've run into a similar issue on spread recently, so it may turn out this is related. I'll mark this issue for further investigation. But in the meanwhile, would the workaround of using a service job with count = 2 work for you?

Consider the recommended use case for ceph csi nodes as a system job.

I appreciate the strictness of the system job enforcing the constraint for system jobs, but is it intentional that it's so heavy it ignores our explicit constraint that is allowed on system job specs? If so then is there a use case at all for allowing the constraint stanza on system jobs because anything other than 1==1 becomes useless as the system job is so strict anything that makes allocation count < node count throw the same confusing message?

    2022-03-18T08:36:19-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-03-18T08:36:19-04:00: Evaluation "b9f5b83a" finished with status "complete" but failed to place all allocations:
    2022-03-18T08:36:19-04:00: Task Group "cephrbd" (failed to place 1 allocation):
      * Constraint "${attr.unique.hostname} != hc4-A": 1 nodes excluded by filter

When looking within the ui it becomes more clear Screen Shot 2022-03-18 at 9 03 54 AM

however the CLI calls it lost:

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cephrbd     0       0         6        18      23        1

While I prefer the CLI over the UI the former makes me confused, we all seem just as Lost with just "running something everywhere except here" type jobs. One thought I had is I don't really care if it's service or system but if I could do service and count = -1 where -1 means all nodes then I can have my constraints but have to add a distinct hosts to achieve the "spread" effect to everyone and my desired "except this host" use case.

tgross commented 2 years ago

Noting that https://github.com/hashicorp/nomad/issues/12366 seems to be related, or at least may provide a clue that we're mutating the job definition unexpectedly.

tgross commented 2 years ago

Also noting that https://github.com/hashicorp/nomad/issues/12748 seems like it could be related.

Juanadelacuesta commented 1 year ago

After revisiting this issue, we were able to pint point it to:

Description: When a system job is given a “distinct_property” constraint that excludes any of the nodes, every time the job is updated, the update will be correctly run but it will return:

2023-04-28T12:12:17+02:00: Evaluation status changed: "pending" -> "complete"
==> 2023-04-28T12:12:17+02:00: Evaluation "ee32c81f" finished with status "complete" but failed to place all allocations:
    2023-04-28T12:12:17+02:00: Task Group "cache" (failed to place 1 allocation):
      * Constraint "distinct_property: ${node.datacenter}=dc1 used by 2 allocs": 1 nodes excluded by filter

Giving the false impression the update didn’t happen.

Root Cause: Because this is a system job, it will define that every task group is required to be running on each node and when checking for the differences between what is running and what should be running, it will iterate over all ready nodes, both with and without running allocations, and for each it will:

Update any running allocation
Verify there is at least one running allocation for each task group defined in the job

And since some of the nodes are filtered and dont have running allocations for that job in particular, they will all be marked as missing allocations. Then when computing the placement of the missing allocations, the iterator will apply the filters, including the “distinct_property” one and wont find any good node to place the “missing” allocation, throwing the error present on the output.

Possible solution:

After the diff is done and iterated over all nodes, a new verification can be done to exclude the “missing” allocations that will break the constraints, here.
Filter out the excluded nodes before iterating and finding the differences, here.
Check for the constraint that cause the filtered out nodes is “distinct_property” and ignore the error, but not sure what possible side effects can happen.

Nomad version 1.5.1

Nomad job example

job "example" {
  type = "system"

    constraint {
    operator  = "distinct_property"
    attribute = "${node.datacenter}"
    value     = "1"
  } 

  group "cache" {
    count = 1

    network {
      port "db" {
        to = 6379
      }
    }

    service {
      name     = "cache"
      port     = "db"
      provider = "nomad"
    }

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "redis" {

      driver = "docker"
      config {
        image = "redis:7"
        ports = ["db"]

        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

    }
  }
}

Running on a cluster with 3 servers and 2 clients.