Nomad is not ready to serve consistent reads when multiple tasks enter restart loop

Dgotlieb commented 1 year ago

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:    20.04
Codename:   focal

Infra resources

10 Clients (3 are also servers) with the below spec:

 $ nomad node status <NODE_ID>

Allocated Resources
CPU                Memory          Disk
169698/332800 MHz  91 GiB/126 GiB  1.5 TiB/3.2 TiB

Allocation Resource Utilization
CPU               Memory
77229/332800 MHz  36 GiB/126 GiB

Host Resource Utilization
CPU               Memory          Disk
77324/332800 MHz  47 GiB/126 GiB  291 GiB/3.4 TiB

Issue

I have a job with 780 groups with 2 tasks inside each group. When all the tasks enter a restart loop (all crashing together) Nomad doesn't handle it well.

Reproduction steps

Run the below Job file and enter all tasks into a restart loop.

Expected Result

Nomad should handle the restart loop and not get stuck.

Actual Result

UI - all tabs are stuck and often show server error message

CLI - no operations work with the below errors:

Unexpected response code: 500 (rpc error: Not ready to serve consistent reads

evaluation reached delivery limit (3)

[ERROR] nomad.raft: failed to take snapshot: error="cannot take snapshot now, wait until the configuration entry at 125550 has been applied (have applied 123884)"

[ERROR] worker: error waiting for Raft index: worker_id=04faa162-292c-a943-b8ec-ca3b9af37be1 error="timed out after 5s waiting for index=125843" index=125843

error fetching node stats: actual resource usage not present

nomad: memberlist: Refuting a suspect message

nomad: yamux: failed to send ping reply: session shutdown

Job file

variable "sensors" {
  type = list(object({
    id   = string
    host = string
  }))
}

job "test" {
  type        = "service"
  datacenters = ["dc1"]

  reschedule {
    delay          = "10s"
    delay_function = "exponential"
    max_delay      = "10m"
    unlimited      = true
  }

  constraint {
    // Limit the number of tasks on each node
    distinct_property = "${attr.unique.hostname}"
    value             = "100"
  }

  dynamic "group" {
    for_each = var.sensors
    iterator = sensor
    labels   = ["sensor_${sensor.value.id}"]

    content {
      count = 1

      ephemeral_disk {
        migrate = true
        size    = 16000
        sticky  = true
      }

      network {
        mode = "bridge"

        port "sensor_http" {
          to = 9000
        }
        port "capture_http" {
          to = 9001
        }
        port "capture_grpc" {
          to = 9002
        }
        port "metrics" {
          to = 9003
        }
        port "envoy_metrics" {
          to = 9004
        }
        port "jpegs" {
          to = 9005
        }
        port "predictions" {
          to = 9006
        }
      }

      restart {
        interval = "10m"
        attempts = 10
        delay    = "60s"
        mode     = "delay"
      }

      service {
        name = "sensor"
        port = 9000

        meta {
          metrics_port       = "${NOMAD_HOST_PORT_sensor_http}"
          envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}"
          logical_node       = "sensor_${sensor.value.id}"
          sensor_id          = "${sensor.value.id}"
        }
      }

      task "sensor" {
        driver = "docker"

        config {
          ports = ["sensor_http"]
          image = "my_image:1.0.0"
          args  = ["service", "-c", "/local/config.json", "--sources-config-url", "file:///local/sources.json"]
          runtime = "nvidia"

          labels {
            service = "sensor"
          }
        }

        template {
          data        = file("sensor.json")
          destination = "local/config.json"
          change_mode = "restart"
        }

        template {
          data        = <<EOH
          {
            "sources": [
              {
                "kind": {
                  "sensor": {
                    "ip": "${sensor.value.host}",
                    "kind": "sensor"
                  }
                },
                "id": "${sensor.value.id}"
              }
            ]
          }
          EOH
          destination = "local/sources.json"
          change_mode = "restart"
        }

        logs {
          max_files     = "2"
          max_file_size = 100
        }

        resources {
          memory     = 500  # MiB
          memory_max = 1000 # MiB
          cpu        = 1000 # in MHZ
        }
      }

      service {
        name = "sensor-storage"
        port = 9002

        meta {
          metrics_port       = "${NOMAD_HOST_PORT_capture_http}"
          envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_metrics}"
          logical_node       = "sensor_${sensor.value.id}"
          exposed_grpc_port = "${NOMAD_HOST_PORT_capture_grpc}"
          sensor_id         = "${sensor.value.id}"
        }
      }

      task "capture_storage" {
        driver = "docker"
        user = "root"

        config {
          ports = ["sensor_http"]
          image = "my_image2:1.0.0"
          args  = ["-c", "/local/config.json"]

          labels {
            service = "capture"
          }
        }

        template {
          data        = file("capture_storage.json")
          destination = "local/config.json"
          change_mode = "restart"
        }

        logs {
          max_files     = "2"
          max_file_size = 100
        }

        resources {
          memory     = 200  # MiB
          memory_max = 1000 # MiB
          cpu        = 250  # in MHZ
        }
      }
    }
  }
}

Server.hcl

server {
    enabled = true

    bootstrap_expect = 3
    server_join {
      retry_join = ["server1", "server2", "server3", "server4"]
      retry_max = 3
      retry_interval = "15s"
    }

    rejoin_after_leave = false
    enabled_schedulers = ["service","batch","system","sysbatch"]
    num_schedulers = 128
    node_gc_threshold = "24h"
    eval_gc_threshold = "30s"
    job_gc_threshold = "4h"
    deployment_gc_threshold = "1h"

    encrypt = ""

    raft_protocol = 3
    default_scheduler_config {
        scheduler_algorithm = "spread"
        memory_oversubscription_enabled = true
        preemption_config {
          service_scheduler_enabled = true
        }
    }

    failover_heartbeat_ttl = "30s"
    heartbeat_grace = "2m"
}

Client.hcl

client {
    enabled = true

    node_class = ""
    no_host_uuid = false

    max_kill_timeout = "30s"
    network_speed = 0
    cpu_total_compute = 0

    gc_interval = "1m"
    gc_disk_usage_threshold = 80
    gc_inode_usage_threshold = 70
    gc_parallel_destroys = 2
    gc_max_allocs = 200

    reserved {
        cpu = 0
        memory = 0
        disk = 0
    }

    meta = {
        "instance_type" = "server"
    }
}

plugin "raw_exec" {
    config {
        enabled = true
    }
}
plugin "docker" {
    config {
        infra_image = "gcr.io/google_containers/pause-amd64:3.1"
        allow_privileged = true
        allow_caps = ["audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot", "net_raw", "sys_time", "sys_ptrace"]
        volumes {
            enabled = true
        }
    }
}
plugin "nvidia-gpu" {
    config {
        enabled = true
        fingerprint_period = "1m"
    }
}

servers = ["server1", "server2"]

server_join {
    retry_join = ["server1", "server2"]
}

Suspect

I suspect the size of the final job HCL containing 780 groups is the issue. I'm not sure how the HCL size is related to the raft indices but many errors are raft-related.

Worth Mentioning

I removed the Connect blocks from the groups to reduce the number of containers, health-checks and Consul-related operations
When tasks are not restarting it takes a few minutes to schedule and finally works well.
All operations including $ nomad system gc, $ nomad system reconcile summaries, $ nomad stop <job_id> -purge are throwing the mentioned errors with no ability to repair the state (even manually).
I was under the assumption that this would fix it.
I know Nomad was tested in the 2M containers challenge, so in terms of the containers count I think I'm good 🙂
but I wonder if a restart loop of multiple containers is something that was tested and the scheduler should be able to handle.

Questions

Is there a limit for the number of groups in a single job spec? if so, what is the limit?
In case I reach this vicious cycle, is there a way to ask the cluster to stop all scheduling operations (mainly evaluations) - some "super admin" action so my operator actions take precedence and bypass all existing requests in the queue?

shoenig commented 1 year ago

Thanks for reporting, @Dgotlieb. There is no hard limit on the number of groups that can exist in a job spec, however as you've found you may end up overwhelming the ability to process the amount of evaluations that get created in a pathological case like all of those tasks entering a crash loop. There's been some work around load shedding of evaluations recently; we'll want to see if any of that applies to this case or if we need to do something extra.

Dgotlieb commented 1 year ago

OK, I will also add that once the job has stopped, I still saw errors in the cluster logs, servers disappearing from the UI/CLI, Failures in Consul health checks for ports 4646/4647/4648, and other issues which could take hours to get resolved on their own.

@shoenig just a few more points:

With regard to my second question, is there any way to bypass all eval/schedule requests in the cluster queue to prevent the cluster from entering this vicious cycle?
Can you please explain what is the relation between the evals size/frequency and the raft as most of the errors are raft-related?
Can you suggest a version/branch to try out?
Is there any way to set max parallel restarts (cluster-wide), so even in a case when 1000's of tasks are restarting together, something in the cluster will limit those restarts (e.g: server_max_parallel_restarts)

Thanks

jamesearl commented 12 months ago

I would also be interested in the answers to the above questions.

hashicorp / nomad