hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Job plan for system job displays max uint64 for create #7019

Open drewbailey opened 4 years ago

drewbailey commented 4 years ago

Nomad version

Output from nomad version

Nomad v0.10.3 (65af1b9ecff5b55a1dd6e10b8c3224f896d6c9fa)

Operating system and Environment details

ubuntu 19.10

Issue

nomad job plan repro.hcl for a system job diaplays max uint64 (likely negative number being computed)

→ nomad job plan repro.hcl
+/- Job: "redis"
+/- Task Group: "cache" (18446744073709551615 create, 2 create/destroy update)
  +   Constraint {
      + LTarget: "${node.class}"
      + Operand: "="
      + RTarget: "class-1"
      }
  +/- Task: "redis" (forces create/destroy update)
    +/- Env[version]: "5" => "2"

+   Task Group: "cache2" (1 create)
    + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Reproduction steps

running a nomad dev cluster with different node classes, run a system job without constraints, plan job with constraints.

nomad job run simplerepro.hcl

then nomad job plan repro.hcl

Job file (if appropriate)

simplerepro.hcl

job "redis" {
  datacenters = ["dc1"]

  type = "system"

  group "cache" {
    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "5"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }
}

repro.hcl

job "redis" {
  datacenters = ["dc1"]

  type = "system"

  # type = "service"
  group "cache2" {
    constraint {
      attribute = "${node.class}"
      value     = "class-2"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "2"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }

  group "cache" {
    constraint {
      attribute = "${node.class}"
      value     = "class-1"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "2"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }
}

Nomad logs (if appropriate)

If possible please post relevant logs in the issue.

    2020-01-29T16:29:47.188-0500 [DEBUG] nomad.job.system_sched: reconciled current state with desired state: eval_id=c948f657-174c-98e9-85c8-7d76d4d7910d job_id=redis namespace=default place=2 update=2 migrate=0 stop=0 ignore=0 lost=0
andrey-mazo commented 1 month ago

Hm, this is still happening on v1.5.17 apparently:

Task Group: "mygroup" (18446744073709551615 create, 1 create/destroy update, 123 in-place update)
tgross commented 1 month ago

Thanks for the verification @andrey-mazo but just want to point out that 1.5.x is out of support at this point.

andrey-mazo commented 1 month ago

Thanks for the verification @andrey-mazo but just want to point out that 1.5.x is out of support at this point.

Yeah, I totally understand. I may be able to retest on a newer version as soon as we upgrade, but don't have a particular timeline for that.

And to be fair, I'm not really concerned about the number itself, but more about what Nomad is actually going to do when running such a job.

andrey-mazo commented 1 month ago

Hm, this is still happening on v1.5.17 apparently:

Task Group: "mygroup" (18446744073709551615 create, 1 create/destroy update, 123 in-place update)

This happens not just when updating constraints on a job/group, but also on changing environment vars, templates, etc.

Interestingly, doing edit+plan from Nomad UI shows a slightly different number:

Task Group: "mygroup" ( 1 create/destroy update 18446744073709552000 create 123 in-place update )

And the issue is not job-specific -- changing constraints on another system job (which is normally placed on the same nodes), for example, shows the same 18446744073709551615 create thing.

andrey-mazo commented 1 month ago

This happens not just when updating constraints on a job/group, but also on changing environment vars, templates, etc.

There was a placement failure, and now after resolving it, only changes to the constraints trigger the 18446744073709551615 create issue -- changing anything else results in a normal 123 create/destroy update. And not even all constraints -- for example, artificially bumping attr.vault.version to a slightly newer version results in a simple "in-place update". (probably because this doesn't really affect the job placement decision in this case)

andrey-mazo commented 1 month ago

To continue my little story here.

I drained one of the nodes -- and it magically resolved the issue. This was a node which would have stopped being eligible for the job with the updated constraints.

So, I suspect that 18446744073709551615 create really meant to say 1 destroy.