hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Nomad job failed decoding driver config #6106

Closed Dirrk closed 5 years ago

Dirrk commented 5 years ago

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Output from nomad version Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Operating system and Environment details

Linux ip-10-... 4.4.0-1084-aws #94-Ubuntu SMP Fri May 17 13:10:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client:
 Version:           18.09.1
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        4c52b90
 Built:             Wed Jan  9 19:35:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       4c52b90
  Built:            Wed Jan  9 19:02:44 2019
  OS/Arch:          linux/amd64
  Experimental:     true

Issue

I have a cron job that will fail when running the allocation with an error (fails every time on any box). The error shows up on the client side and looks like this:

Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.622Z [DEBUG] client: updated allocations: index=1817064 total=31 pulled=1 filtered=30
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.622Z [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=30
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.624Z [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=30 errors=0
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.629Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs path=/bin/nomad args="[/bin/nomad logmon]"
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.629Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs path=/bin/nomad pid=64913
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.629Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs path=/bin/nomad
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.666Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs version=2
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.666Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs address=/tmp/plugin166144997 network=unix @module=logmon timestamp=2019-08-09T16:56:54.666Z
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.667Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs @module=logmon path=/opt/nomad/data/alloc/0549fedd-db2f-e8a0-45f6-1c865b6c3914/alloc/logs/.clean-efs.stdout.fifo timestamp=2019-08-09T16:56:54.667Z
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.667Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs path=/opt/nomad/data/alloc/0549fedd-db2f-e8a0-45f6-1c865b6c3914/alloc/logs/.clean-efs.stderr.fifo @module=logmon timestamp=2019-08-09T16:56:54.667Z
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.675Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs error="failed to decode driver config: [pos 571]: readContainerLen: Unrecognized descriptor byte: hex: d4, decimal: 212"
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.676Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs reason="Policy allows no restarts"
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.679Z [INFO ] client.gc: marking allocation for GC: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.853Z [DEBUG] client: updated allocations: index=1817066 total=31 pulled=0 filtered=31
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.853Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=31
Aug 09 16:56:54 ip-10-* nomad[79767]:     2019-08-09T16:56:54.853Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=31 errors=0
Aug 09 16:56:58 ip-10-* nomad[79767]:     2019-08-09T16:56:58.680Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs @module=logmon timestamp=2019-08-09T16:56:58.680Z
Aug 09 16:56:58 ip-10-* nomad[79767]:     2019-08-09T16:56:58.680Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs @module=logmon timestamp=2019-08-09T16:56:58.680Z
Aug 09 16:56:58 ip-10-* nomad[79767]:     2019-08-09T16:56:58.682Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs path=/bin/nomad pid=64913
Aug 09 16:56:58 ip-10-* nomad[79767]:     2019-08-09T16:56:58.682Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs
Aug 09 16:56:58 ip-10-* nomad[79767]:     2019-08-09T16:56:58.682Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=0549fedd-db2f-e8a0-45f6-1c865b6c3914 task=clean-efs

However, I changed the job Id by adding a -2 to the end of the job name and the failure does not happen. The -2 version of the job is the exact same job file as shown below but works without issue. I tried just stopping the original and then running garbage collection (the job no longer showed up and plan shows new job) but when I run that job again with the same Id it still fails every time. I think there is something stored incorrectly in the server db and I cannot reproduce it except for using the same job Id.

Reproduction steps

Specific to my cluster and I am not sure what data to give you to help you debug this.

Job file (if appropriate)

Failing job:

job "cron-solr-efs-cleanup" {
  region = "dev"

  datacenters = ["ae1"]

  meta {
    terraform     = true
    repo_location = "nomad-jobs"
    image         = "solr-tasks"
    image_tag     = "feature-web-6604-solr-backup-s3"
  }

  reschedule {
    attempts  = 0
    interval  = "1h"
    delay     = "15m"
    unlimited = false
  }

  type = "batch"

  periodic {
    cron             = "@daily"
    prohibit_overlap = true
    time_zone        = "America/New_York"
  }

  constraint {
    attribute = "${meta.class}"
    value     = "solr-compute"
  }

  group "solr-backup" {
    count = 1

    restart {
      attempts = 0
      mode     = "fail"
    }

    task "clean-efs" {
      driver = "docker"
      leader = true

      config {
        image        = "MYHOST/${NOMAD_META_image}:${NOMAD_META_image_tag}"
        force_pull   = true
        network_mode = "host"

        volumes = [
          "/mnt/:/mnt:rw,rshared",
        ]

        extra_hosts = [
          "somehost:10.0.0.0",
          "somehost2:10.0.0.1",
        ]

        logging {
          type = "fluentd"

          config {
            tag                   = "docker.solr-task.cleanup-efs.${NOMAD_META_solr_cluster}.${NOMAD_REGION}"
            fluentd-buffer-limit  = "1g"
            fluentd-max-retries   = 144
            fluentd-retry-wait    = "600s"
            fluentd-async-connect = true
          }
        }
      }

      env {
        DAYS_TO_KEEP      = "1"
        TASK_NAME         = "cleanup_efs"
        SOLR_CLUSTER_NAME = "all"
        SOLR_BACKUP_DIR   = "/mnt/efs/solr-backup/"
        ROOT_LOG_DIR      = "${NOMAD_ALLOC_DIR}/logs/"
      }

      resources {
        cpu    = 500
        memory = 500
      }
    }
  }
}

working job:

job "cron-solr-efs-cleanup-2" {
  region = "dev"

  datacenters = ["ae1"]

  meta {
    terraform     = true
    repo_location = "nomad-jobs"
    image         = "solr-tasks"
    image_tag     = "feature-web-6604-solr-backup-s3"
  }

  reschedule {
    attempts  = 0
    interval  = "1h"
    delay     = "15m"
    unlimited = false
  }

  type = "batch"

  periodic {
    cron             = "@daily"
    prohibit_overlap = true
    time_zone        = "America/New_York"
  }

  constraint {
    attribute = "${meta.class}"
    value     = "solr-compute"
  }

  group "solr-backup" {
    count = 1

    restart {
      attempts = 0
      mode     = "fail"
    }

    task "clean-efs" {
      driver = "docker"
      leader = true

      config {
        image        = "MYHOST/${NOMAD_META_image}:${NOMAD_META_image_tag}"
        force_pull   = true
        network_mode = "host"

        volumes = [
          "/mnt/:/mnt:rw,rshared",
        ]

        extra_hosts = [
          "somehost:10.0.0.0",
          "somehost2:10.0.0.1",
        ]

        logging {
          type = "fluentd"

          config {
            tag                   = "docker.solr-task.cleanup-efs.${NOMAD_META_solr_cluster}.${NOMAD_REGION}"
            fluentd-buffer-limit  = "1g"
            fluentd-max-retries   = 144
            fluentd-retry-wait    = "600s"
            fluentd-async-connect = true
          }
        }
      }

      env {
        DAYS_TO_KEEP      = "1"
        TASK_NAME         = "cleanup_efs"
        SOLR_CLUSTER_NAME = "all"
        SOLR_BACKUP_DIR   = "/mnt/efs/solr-backup/"
        ROOT_LOG_DIR      = "${NOMAD_ALLOC_DIR}/logs/"
      }

      resources {
        cpu    = 500
        memory = 500
      }
    }
  }
}

Nomad Client logs (if appropriate)

If possible please post relevant logs in the issue.

Logs and other artifacts may also be sent to: nomad-oss-debug@hashicorp.com

Please link to your Github issue in the email and reference it in the subject line:

To: nomad-oss-debug@hashicorp.com

Subject: GH-1234: Errors garbage collecting allocs

Emails sent to that address are readable by all HashiCorp employees but are not publicly visible.

Nomad Server logs (if appropriate)

I have taken the snapshot from the server and copied it locally, I can send it to you if needed, but I will also perform additional testing if given feedback. I can reproduce this 100% of the time and my current work around (the new job id works fine)

Dirrk commented 5 years ago

This issue was due to the meta variable "solr_cluster" not existing in the previous job. When it did work, I had it commented out and setup for json-file logging to view the logs in the ui. This was a known problem and not an issue with nomad.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.