hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

allocations stuck in pending for a really long time over an hour (CSI?) #23667

Open ctgschollar opened 1 month ago

ctgschollar commented 1 month ago

Nomad version

Nomad v1.8.0 BuildDate 2024-05-28T17:38:17Z Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Operating system and Environment details

Servers and clients

3 server nodes running as VMs on a proxmox cluster

3 client nodes running on hosts

All running

Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy

Volumes

I am running the ceph rbd csi for volumes for all jobs

Driver

Docker version 24.0.6, build ed223bc

Details

I am trying to run some pretty long running batch jobs to do data conversion on large datasets of up to 10TB. The nomad cluster is currently only running these jobs. All the jobs are batch jobs.

I create a new volume for each job, before I run the job using a separate api I created to manage my ceph rbd space.

Immediately after creating the volume I then run the job which seems to load fine. The job then sits in recovering for an hour with no indication of any error, however the jobs doesn't start to run.

Issue

The job is allocated, however the job goes into the "Recovering" status immediately. I can see the jobs are allocated to my clients and have enough resources, however they seem to just wait for some reason.

It then hangs out there for a really long time, from half an hour to over an hour before finally starting the tasks.

Reproduction steps

Any time I run one of my batch jobs

Expected Result

Job should start running when it is allocated. at least within a few mintues

Actual Result

Job stuck in "recovering" for over an hour. With no indication of what the issue is

Job file (if appropriate)

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "transfer_1721578273_30480",
  "ParentID": "",
  "Name": "transfer_1721578273_30480",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "chpc_nomad_18"
  ],
  "NodePool": "default",
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "transfer",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 86400000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "Disconnect": null,
      "Tasks": [
        {
          "Name": "reserve-volume",
          "Driver": "docker",
          "User": "root",
          "Config": {
            "command": "python",
            "image": "harbor.sdp.kat.ac.za/sdp-services/ceph-rbd-manager:production",
            "args": [
              "scripts/volume.py",
              "reserve",
              "1721578273_30480",
              "1.9023799892364577"
            ],
            "auth": [
              {
               REDACTED
              }
            ]
          },
          "Env": {
           REDACTED
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "Cores": 0,
            "MemoryMB": 300,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 3,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": {
            "Hook": "prestart",
            "Sidecar": false
          },
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null,
          "Schedule": null
        },
        {
          "Name": "release-volume",
          "Driver": "docker",
          "User": "root",
          "Config": {
            "auth": [
              {
                REDACTED
              }
            ],
            "command": "python",
            "image": "harbor.sdp.kat.ac.za/sdp-services/ceph-rbd-manager:production",
            "args": [
              "scripts/volume.py",
              "release",
              "1721578273_30480"
            ]
          },
          "Env": {
            "RESOURCE_MANAGER_URL": "http://k8s103.sdp.mkat.chpc.kat.ac.za:5000"
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "Cores": 0,
            "MemoryMB": 300,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 3,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": {
            "Hook": "poststop",
            "Sidecar": false
          },
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null,
          "Schedule": null
        },
        {
          "Name": "transfer",
          "Driver": "docker",
          "User": "root",
          "Config": {
            "image": "harbor.sdp.kat.ac.za/sdp-services/transfer:debug",
            "network_mode": "host",
            "volumes": [
              REDACTED
            ],
            "args": [
              "--job-id",
              "30480"
            ],
            "auth": [
              {
                REDACTED
              }
            ],
            "command": "transfer.py",
            "extra_hosts": [
              REDACTED
            ]
          },
          "Env": {
            REDACTED
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 8000,
            "Cores": 0,
            "MemoryMB": 25000,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 3,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": [
            {
              "Volume": "ceph-rbd",
              "Destination": "/scratch/kat",
              "ReadOnly": false,
              "PropagationMode": "private",
              "SELinuxLabel": ""
            }
          ],
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null,
          "Schedule": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 1,
        "Interval": 86400000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Consul": {
        "Namespace": "",
        "Cluster": "default",
        "Partition": ""
      },
      "Services": null,
      "Volumes": {
        "ceph-rbd": {
          "Name": "ceph-rbd",
          "Type": "csi",
          "Source": "1721578273_30480",
          "ReadOnly": false,
          "AccessMode": "single-node-writer",
          "AttachmentMode": "file-system",
          "MountOptions": null,
          "PerAlloc": false
        }
      },
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "PreventRescheduleOnLost": false
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": null,
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1721659998355071500,
  "CreateIndex": 237805449,
  "ModifyIndex": 237805450,
  "JobModifyIndex": 237805449,
  "UI": null,
  "meta": {
    "index": "237805450"
  }
}

Inspecting jobs


nomad alloc status 6ecf78cc-787e-00b1-20e9-ef1fa2d9c1fb
ID                  = 6ecf78cc-787e-00b1-20e9-ef1fa2d9c1fb
Eval ID             = d0cd14a2
Name                = transfer_1721578273_30480.transfer[0]
Node ID             = 18c6b2b8
Node Name           = k8s102
Job ID              = transfer_1721578273_30480
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 35m43s ago
Modified            = 35m43s ago
root@nomad-svr-1:/home/kat# nomad alloc status -verbose 6ecf78cc-787e-00b1-20e9-ef1fa2d9c1fb
ID                  = 6ecf78cc-787e-00b1-20e9-ef1fa2d9c1fb
Eval ID             = d0cd14a2-55d6-3083-a1c1-54d41b8b1da7
Name                = transfer_1721578273_30480.transfer[0]
Node ID             = 18c6b2b8-1bfb-859c-cee4-2719fdc2bd90
Node Name           = k8s102
Job ID              = transfer_1721578273_30480
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2024-07-22T14:53:18Z
Modified            = 2024-07-22T14:53:18Z
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 207.084µs
Failures            = 0

Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
18c6b2b8-1bfb-859c-cee4-2719fdc2bd90  0.845    0                  0              0                        0.845
196f726d-4101-2417-1d4b-f428d524895f  0.305    0                  0              0                        0.305

nomad job allocs -verbose transfer_1721578273_30480
ID                                    Eval ID                               Node ID                               Node Name  Task Group  Version  Desired  Status   Created               Modified
6ecf78cc-787e-00b1-20e9-ef1fa2d9c1fb  d0cd14a2-55d6-3083-a1c1-54d41b8b1da7  18c6b2b8-1bfb-859c-cee4-2719fdc2bd90  k8s102     transfer    0        run      pending  2024-07-22T14:53:18Z  2024-07-22T14:53:18Z

Logs

nomad monitor -log-level=TRACE
2024-07-22T15:25:19.594Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="182.358µs"
2024-07-22T15:25:24.623Z [TRACE] client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=5facb53e-a9db-652b-ca73-130a2772cb75 plugin.name=ceph-csi plugin.type=node task=ceph-node grpc.code=OK duration="642.727µs" grpc.service=csi.v1.Identity grpc.method=Probe
2024-07-22T15:25:24.666Z [TRACE] client.csi_manager.ceph-csi: finished client unary call: grpc.code=OK duration="501.517µs" grpc.service=csi.v1.Identity grpc.method=Probe
2024-07-22T15:25:24.666Z [TRACE] client.csi_manager.ceph-csi: finished client unary call: grpc.code=OK duration="570.479µs" grpc.service=csi.v1.Node grpc.method=NodeGetCapabilities
2024-07-22T15:25:27.500Z [TRACE] client: next heartbeat: period=18.08640635s
2024-07-22T15:25:29.596Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="172.309µs"
2024-07-22T15:25:39.597Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="160.616µs"
2024-07-22T15:25:44.663Z [TRACE] consul.sync: execute sync: reason=periodic
2024-07-22T15:25:45.588Z [TRACE] client: next heartbeat: period=17.970925372s
2024-07-22T15:25:49.599Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="724.042µs"
2024-07-22T15:25:53.637Z [DEBUG] client: updated allocations: index=237805521 total=4 pulled=1 filtered=3
2024-07-22T15:25:53.637Z [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=3
2024-07-22T15:25:53.637Z [TRACE] client: next heartbeat: period=13.072201027s
2024-07-22T15:25:53.638Z [TRACE] client.alloc_runner.task_coordinator: state transition: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb from=init to=init
2024-07-22T15:25:53.642Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb task=reserve-volume type=Received msg="Task received by client" failed=false
2024-07-22T15:25:53.643Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb task=release-volume type=Received msg="Task received by client" failed=false
2024-07-22T15:25:53.644Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb task=transfer type=Received msg="Task received by client" failed=false
2024-07-22T15:25:53.644Z [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=3 errors=0
2024-07-22T15:25:53.644Z [DEBUG] client.alloc_runner.widmgr: retrieving and renewing workload identities: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb num_identities=0
2024-07-22T15:25:53.650Z [DEBUG] client.alloc_runner.runner_hook.csi_hook: found CSI plugin: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb type=csi-node name=ceph-csi
2024-07-22T15:25:53.650Z [TRACE] client.csi_manager.ceph-csi.volume_manager: Preparing volume staging environment: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb volume_id=1708650386_30483
2024-07-22T15:25:53.650Z [TRACE] client.csi_manager.ceph-csi.volume_manager: Volume staging environment: alloc_id=5e4ac4fb-0615-ef32-cc14-e68edd197beb volume_id=1708650386_30483 pre-existing_mount=false host_staging_path=/opt/nomad/data/client/csi/node/ceph-csi/staging/default/1708650386_30483/rw-file-system-single-node-writer plugin_staging_path=/local/csi/staging/default/1708650386_30483/rw-file-system-single-node-writer
2024-07-22T15:25:53.650Z [TRACE] client: next heartbeat: period=13.312347336s
2024-07-22T15:25:54.624Z [TRACE] client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=5facb53e-a9db-652b-ca73-130a2772cb75 plugin.name=ceph-csi plugin.type=node task=ceph-node grpc.code=OK duration="606.568µs" grpc.service=csi.v1.Identity grpc.method=Probe
2024-07-22T15:25:54.668Z [TRACE] client.csi_manager.ceph-csi: finished client unary call: grpc.code=OK duration="575.759µs" grpc.service=csi.v1.Identity grpc.method=Probe
2024-07-22T15:25:54.668Z [TRACE] client.csi_manager.ceph-csi: finished client unary call: grpc.code=OK duration="559.569µs" grpc.service=csi.v1.Node grpc.method=NodeGetCapabilities
2024-07-22T15:25:59.600Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="195.152µs"
2024-07-22T15:26:06.965Z [TRACE] client: next heartbeat: period=16.060088472s
2024-07-22T15:26:09.601Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration="164.775µs"

For an example job transfer_1537557537_30479

Nomad Server logs (if appropriate)

Not sure how to get these

Nomad Client logs (if appropriate)

ctgschollar commented 1 month ago

I have tried rebuilding my cluster once in case some of my experimenting to get to this point caused any issues, however on a completely new cluster, with new OS, and new nomad it still fails to start jobs that seem to have been allocated, but won't deploy. Currently sitting at about an hour and a half with all of these jobs in "starting" and allocated, but not doing anything.

image

ctgschollar commented 1 month ago

Some logs from a client

Jul 22 17:39:56 k8s101 nomad[9815]:     2024-07-22T17:39:56.762Z [INFO]  client: node registration complete
Jul 22 17:40:05 k8s101 nomad[9815]:     2024-07-22T17:40:05.541Z [INFO]  client: node registration complete
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.555Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Received msg="Task received by client" failed=false
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.558Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type="Task Setup" msg="Building Task Directory" failed=false
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.605Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.605Z [INFO]  agent: (runner) creating watcher
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.605Z [INFO]  agent: (runner) starting
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.625Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/34be2794-a5ff-fd61-086b-c2101f432ec2/ceph-node/local/config.json"
Jul 22 17:47:15 k8s101 nomad[9815]:     2024-07-22T17:47:15.637Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Driver msg="Downloading image" failed=false
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.053Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=5edb5d60dbe16179aa806aee1cc8e295303b7f791c58c3526ede7046e2605088
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.265Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=5edb5d60dbe16179aa806aee1cc8e295303b7f791c58c3526ede7046e2605088
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.301Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Started msg="Task started by client" failed=false
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.489Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Terminated msg="Exit Code: 255, Exit Message: \"Docker container exited with non-zero exit code: >
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.494Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/usr/bin/nomad id=10526
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.494Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node reason="Restart within policy" delay=17.92143415s
Jul 22 17:47:49 k8s101 nomad[9815]:     2024-07-22T17:47:49.494Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Restarting msg="Task restarting in 17.92143415s" failed=false
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.447Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=6d736fa22ef2f091605361e32700d5d2dce99ac9cf31140a982428c975511d73
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.624Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=6d736fa22ef2f091605361e32700d5d2dce99ac9cf31140a982428c975511d73
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.661Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Started msg="Task started by client" failed=false
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.893Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Terminated msg="Exit Code: 255, Exit Message: \"Docker container exited with non-zero exit code: >
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.899Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node reason="Restart within policy" delay=18.638869179s
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.899Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Restarting msg="Task restarting in 18.638869179s" failed=false
Jul 22 17:48:07 k8s101 nomad[9815]:     2024-07-22T17:48:07.901Z [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker plugin=/usr/bin/nomad id=10761
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.302Z [ERROR] client.alloc_runner.task_runner.task_hook: killing task because plugin failed: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node error="CSI plugin failed probe: timeout while connecting to gRPC soc>
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.302Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type="Plugin became unhealthy" msg="Error: CSI plugin failed probe: timeout while connecting to gRPC s>
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.303Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Killing msg="CSI plugin did not become healthy before configured 30s health timeout" failed=true
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.313Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node plugin=/usr/bin/nomad id=10288
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.314Z [INFO]  agent: (runner) stopping
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.314Z [ERROR] client.alloc_runner.task_runner.task_hook: failed to kill task: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node kill_reason="CSI plugin failed probe: timeout while connecting to gRPC socket: fail>
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.314Z [INFO]  agent: (runner) received finish
Jul 22 17:48:19 k8s101 nomad[9815]:     2024-07-22T17:48:19.314Z [INFO]  client.gc: marking allocation for GC: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.194Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 task=ceph-node type=Received msg="Task received by client" failed=false
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.196Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 task=ceph-node type="Task Setup" msg="Building Task Directory" failed=false
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.241Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.241Z [INFO]  agent: (runner) creating watcher
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.241Z [INFO]  agent: (runner) starting
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.263Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/4bedb61d-5713-05b2-9adc-cb5671d1fe62/ceph-node/local/config.json"
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.285Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=41d225f208e95e8b1dc2ba31056d2300ec97dec55dacbade79e05f72f6f99c3f
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.484Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=41d225f208e95e8b1dc2ba31056d2300ec97dec55dacbade79e05f72f6f99c3f
Jul 22 17:50:05 k8s101 nomad[9815]:     2024-07-22T17:50:05.514Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 task=ceph-node type=Started msg="Task started by client" failed=false
Jul 22 17:50:10 k8s101 nomad[9815]:     2024-07-22T17:50:10.516Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 plugin.name=ceph-csi plugin.type=node task=ceph-node grpc.code=DeadlineExceeded >
Jul 22 17:50:15 k8s101 nomad[9815]:     2024-07-22T17:50:15.521Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 task=ceph-node type="Plugin became healthy" msg="plugin: ceph-csi" failed=false
Jul 22 17:50:23 k8s101 nomad[9815]:     2024-07-22T17:50:23.799Z [INFO]  client: node registration complete
Jul 22 17:53:52 k8s101 nomad[9815]:     2024-07-22T17:53:52.277Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=reserve-volume type=Received msg="Task received by client" failed=false
Jul 22 17:53:52 k8s101 nomad[9815]:     2024-07-22T17:53:52.278Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=release-volume type=Received msg="Task received by client" failed=false
Jul 22 17:53:52 k8s101 nomad[9815]:     2024-07-22T17:53:52.279Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=transfer type=Received msg="Task received by client" failed=false
Jul 22 17:53:57 k8s101 nomad[9815]:     2024-07-22T17:53:57.594Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7e62615d-0fca-790b-e292-ab21b09fee14 task=reserve-volume type=Received msg="Task received by client" failed=false
Jul 22 17:53:57 k8s101 nomad[9815]:     2024-07-22T17:53:57.595Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7e62615d-0fca-790b-e292-ab21b09fee14 task=release-volume type=Received msg="Task received by client" failed=false
Jul 22 17:53:57 k8s101 nomad[9815]:     2024-07-22T17:53:57.596Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7e62615d-0fca-790b-e292-ab21b09fee14 task=transfer type=Received msg="Task received by client" failed=false
Jul 22 17:54:04 k8s101 nomad[9815]:     2024-07-22T17:54:04.774Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0f2059b4-ebb9-5a91-0936-5a71cd467214 task=reserve-volume type=Received msg="Task received by client" failed=false
Jul 22 17:54:04 k8s101 nomad[9815]:     2024-07-22T17:54:04.775Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0f2059b4-ebb9-5a91-0936-5a71cd467214 task=release-volume type=Received msg="Task received by client" failed=false
Jul 22 17:54:04 k8s101 nomad[9815]:     2024-07-22T17:54:04.775Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0f2059b4-ebb9-5a91-0936-5a71cd467214 task=transfer type=Received msg="Task received by client" failed=false
Jul 22 17:54:09 k8s101 nomad[9815]:     2024-07-22T17:54:09.661Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=dc9e05b4-dd77-c686-572a-40d76d89adc7 task=reserve-volume type=Received msg="Task received by client" failed=false
Jul 22 17:54:09 k8s101 nomad[9815]:     2024-07-22T17:54:09.662Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=dc9e05b4-dd77-c686-572a-40d76d89adc7 task=release-volume type=Received msg="Task received by client" failed=false
Jul 22 17:54:09 k8s101 nomad[9815]:     2024-07-22T17:54:09.662Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=dc9e05b4-dd77-c686-572a-40d76d89adc7 task=transfer type=Received msg="Task received by client" failed=false
Jul 22 18:54:15 k8s101 nomad[9815]:     2024-07-22T18:54:15.537Z [INFO]  client.gc: garbage collecting allocation: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 reason="forced collection"
Jul 22 18:54:15 k8s101 nomad[9815]:     2024-07-22T18:54:15.538Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=34be2794-a5ff-fd61-086b-c2101f432ec2 task=ceph-node type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false

and status of one of the allocs mentioned in the log

nomad alloc status ed62f907
ID                  = ed62f907-a42f-a6cf-2c22-14af2edb8031
Eval ID             = 2dd12094
Name                = transfer_1708650386_30486.transfer[0]
Node ID             = 63eab7b8
Node Name           = k8s101
Job ID              = transfer_1708650386_30486
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h56m ago
Modified            = 1h56m ago
root@nomad-svr-3:/etc/nomad.d# nomad alloc status -verbose ed62f907
ID                  = ed62f907-a42f-a6cf-2c22-14af2edb8031
Eval ID             = 2dd12094-d862-c4d6-7dc8-3760d590899b
Name                = transfer_1708650386_30486.transfer[0]
Node ID             = 63eab7b8-f12d-69d7-23f9-15312b483cf3
Node Name           = k8s101
Job ID              = transfer_1708650386_30486
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2024-07-22T17:53:52Z
Modified            = 2024-07-22T17:53:52Z
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 148.233µs
Failures            = 0

Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
63eab7b8-f12d-69d7-23f9-15312b483cf3  0.695    0                  0              0                        0.695
b59e5084-05a1-6bff-cb6d-5817bda3c0a2  0.688    0                  0              0                        0.688
ctgschollar commented 1 month ago

A Job definition that works properly

job "test" {
  type        = "batch"
  datacenters = ["chpc_nomad_18"]

  group "test" {

    volume "test" {
      type            = "csi"
      source          = "test"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      read_only       = false
    } 

    task "whalesay" {
      user = "root"
      driver = "docker"

      config {
        auth {
          REDACTED
        }
        network_mode = "host"
        image = "harbor.sdp.kat.ac.za/dockerhub_proxy/docker/whalesay"
        command = "cowsay"
        extra_hosts = ["REDACTED"]
        args = ["yolo"]
        volumes = ["/home/kat/.ssh:/root/.ssh","/home/kat/.aws/:/root/.aws"]
      }

      volume_mount {
        volume = "test"
        destination = "/scratch/kat"
        read_only = false
      }

      resources {
        cpu    = 100 
        memory = 100
      }
      env {
        REDACTED
      }
    }
  }
}
ctgschollar commented 1 month ago

All the jobs did eventually run, but it took over 2 hours for them to start.

tgross commented 1 month ago

Hi @ctgschollar! Unfortunately there's not a lot to go on here in your logs. For example, if I look at the most recent log you've posted, I see the Ceph plugin being marked healthy and fingerprinting to the server a few minutes before the allocation ed62f907 is placed:

Jul 22 17:50:15 k8s101 nomad[9815]: 2024-07-22T17:50:15.521Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=4bedb61d-5713-05b2-9adc-cb5671d1fe62 task=ceph-node type="Plugin became healthy" msg="plugin: ceph-csi" failed=false Jul 22 17:50:23 k8s101 nomad[9815]: 2024-07-22T17:50:23.799Z [INFO] client: node registration complete ... Jul 22 17:53:52 k8s101 nomad[9815]: 2024-07-22T17:53:52.277Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=reserve-volume type=Received msg="Task received by client" failed=false Jul 22 17:53:52 k8s101 nomad[9815]: 2024-07-22T17:53:52.278Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=release-volume type=Received msg="Task received by client" failed=false Jul 22 17:53:52 k8s101 nomad[9815]: 2024-07-22T17:53:52.279Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ed62f907-a42f-a6cf-2c22-14af2edb8031 task=transfer type=Received msg="Task received by client" failed=false

But these are all info-level logs and don't go past that initial few minutes. What you'll want to look for are the logs that cover the whole history for this allocation (all the logs relevant to us here should have the alloc ID field), and provide the logs in debug (or even better, trace) level. My guess what's happening here is that the CSI plugin is taking a very long time to provision the volume from Ceph, but without logs that's only a guess. If that is the case, you may also want to get the logs for the Ceph node plugin allocation (via nomad alloc logs).

Aside:

k8s101

Cheeky! :grin: