hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Nomad restarting task after alloc successful #3587

Closed sunileman closed 6 years ago

sunileman commented 6 years ago

Nomad version

1.0.0

Operating system and Environment details

Amazon linux

Issue

I am able to run my docker container on same host with no issue using: docker run sunileman/nifi1.1.0

Launching with nomad (consul client agent co-located) the task continues to restart with no clear information on issue.

Reproduction steps

on aws p2.xlarge instance with nomad & consul client. from nomad consul client node run job with supplied job confi

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

2017/11/26 23:17:00.778310 [DEBUG] client: driver event for alloc "68b73ef5-069f-6130-22df-b44c6654a57f": Downloading image sunileman/nifi1.1.0:latest 2017/11/26 23:17:00.949954 [DEBUG] driver.docker: docker pull sunileman/nifi1.1.0:latest succeeded 2017-11-26T23:17:00.950Z [DEBUG] plugin: starting plugin: path=/opt/nomad/bin/nomad args="[/opt/nomad/bin/nomad executor {"LogFile":"/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/executor.out","LogLevel":"DEBUG"}]" 2017-11-26T23:17:00.951Z [DEBUG] plugin: waiting for RPC address: path=/opt/nomad/bin/nomad 2017-11-26T23:17:00.963Z [DEBUG] plugin.nomad: plugin address: timestamp=2017-11-26T23:17:00.963Z address=/tmp/plugin661915386 network=unix 2017/11/26 23:17:00.966408 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin854779409 2017/11/26 23:17:00.966429 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin854779409]} 2017/11/26 23:17:00.966438 [DEBUG] driver.docker: using 12582912 bytes memory for try1 2017/11/26 23:17:00.966446 [DEBUG] driver.docker: using 20 cpu shares for try1 2017/11/26 23:17:00.966466 [DEBUG] driver.docker: binding directories []string{"/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/alloc:/alloc", "/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/local:/local", "/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/secrets:/secrets"} for try1 2017/11/26 23:17:00.966476 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge 2017/11/26 23:17:00.966491 [DEBUG] driver.docker: allocated port 172.30.2.229:24039 -> 8080 (mapped) 2017/11/26 23:17:00.966501 [DEBUG] driver.docker: exposed port 8080 2017/11/26 23:17:00.966523 [DEBUG] driver.docker: setting container name to: try1-68b73ef5-069f-6130-22df-b44c6654a57f 2017/11/26 23:17:00.992357 [DEBUG] client: updated allocations at index 449 (total 2) (pulled 0) (filtered 2) 2017/11/26 23:17:00.992456 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2) 2017/11/26 23:17:01.878949 [INFO] driver.docker: created container 976cc87adcac880b31a93e8c676ccf36f17edb62ae9296028a90da711ea3f748 2017/11/26 23:17:02.738910 [INFO] driver.docker: started container 976cc87adcac880b31a93e8c676ccf36f17edb62ae9296028a90da711ea3f748 2017/11/26 23:17:02.750654 [WARN] client: error fetching stats of task try1: stats collection hasn't started yet 2017/11/26 23:17:02.762916 [DEBUG] consul.sync: registered 1 services, 1 checks; deregistered 0 services, 0 checks 2017/11/26 23:17:03.680706 [DEBUG] client: updated allocations at index 450 (total 2) (pulled 0) (filtered 2) 2017/11/26 23:17:03.680808 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2) 2017/11/26 23:17:08.011752 [DEBUG] driver.docker: error collecting stats from container 976cc87adcac880b31a93e8c676ccf36f17edb62ae9296028a90da711ea3f748: io: read/write on closed pipe 2017-11-26T23:17:08.012Z [DEBUG] plugin: plugin process exited: path=/opt/nomad/bin/nomad 2017/11/26 23:17:08.026709 [INFO] client: task "try1" for alloc "68b73ef5-069f-6130-22df-b44c6654a57f" completed successfully 2017/11/26 23:17:08.026730 [INFO] client: Restarting task "try1" for alloc "68b73ef5-069f-6130-22df-b44c6654a57f" in 17.714321425s 2017/11/26 23:17:08.047107 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 1 services, 1 checks 2017/11/26 23:17:08.192378 [DEBUG] client: updated allocations at index 451 (total 2) (pulled 0) (filtered 2) 2017/11/26 23:17:08.192459 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2) 2017/11/26 23:17:10.469287 [DEBUG] http: Request /v1/agent/health?type=client (144.219µs) 2017/11/26 23:17:20.470497 [DEBUG] http: Request /v1/agent/health?type=client (238.65µs) 2017/11/26 23:17:25.741634 [DEBUG] client: driver event for alloc "68b73ef5-069f-6130-22df-b44c6654a57f": Downloading image sunileman/nifi1.1.0:latest 2017/11/26 23:17:25.825717 [DEBUG] driver.docker: docker pull sunileman/nifi1.1.0:latest succeeded 2017-11-26T23:17:25.826Z [DEBUG] plugin: starting plugin: path=/opt/nomad/bin/nomad args="[/opt/nomad/bin/nomad executor {"LogFile":"/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/executor.out","LogLevel":"DEBUG"}]" 2017-11-26T23:17:25.826Z [DEBUG] plugin: waiting for RPC address: path=/opt/nomad/bin/nomad 2017-11-26T23:17:25.839Z [DEBUG] plugin.nomad: plugin address: timestamp=2017-11-26T23:17:25.839Z address=/tmp/plugin283545805 network=unix 2017/11/26 23:17:25.841912 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin476816840 2017/11/26 23:17:25.841949 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin476816840]} 2017/11/26 23:17:25.841962 [DEBUG] driver.docker: using 12582912 bytes memory for try1 2017/11/26 23:17:25.841967 [DEBUG] driver.docker: using 20 cpu shares for try1 2017/11/26 23:17:25.841985 [DEBUG] driver.docker: binding directories []string{"/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/alloc:/alloc", "/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/local:/local", "/opt/nomad/alloc/68b73ef5-069f-6130-22df-b44c6654a57f/try1/secrets:/secrets"} for try1 2017/11/26 23:17:25.841995 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge 2017/11/26 23:17:25.842012 [DEBUG] driver.docker: allocated port 172.30.2.229:24039 -> 8080 (mapped) 2017/11/26 23:17:25.842021 [DEBUG] driver.docker: exposed port 8080 2017/11/26 23:17:25.842045 [DEBUG] driver.docker: setting container name to: try1-68b73ef5-069f-6130-22df-b44c6654a57f 2017/11/26 23:17:25.899194 [INFO] driver.docker: created container f7314be8f2560b756fcd29baaa27a54669e75ee72cc9d23da659900ed99ccd0d 2017/11/26 23:17:25.996369 [DEBUG] client: updated allocations at index 453 (total 2) (pulled 0) (filtered 2) 2017/11/26 23:17:25.996463 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2) 2017/11/26 23:17:26.153726 [INFO] driver.docker: started container f7314be8f2560b756fcd29baaa27a54669e75ee72cc9d23da659900ed99ccd0d 2017/11/26 23:17:26.165413 [WARN] client: error fetching stats of task try1: stats collection hasn't started yet 2017/11/26 23:17:26.171058 [DEBUG] consul.sync: registered 1 services, 1 checks; deregistered 0 services, 0 checks 2017/11/26 23:17:26.395392 [DEBUG] client: updated allocations at index 454 (total 2) (pulled 0) (filtered 2) 2017/11/26 23:17:26.395476 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2) 2017/11/26 23:17:29.663142 [DEBUG] driver.docker: error collecting stats from container f7314be8f2560b756fcd29baaa27a54669e75ee72cc9d23da659900ed99ccd0d: io: read/write on closed pipe 2017-11-26T23:17:29.664Z [DEBUG] plugin: plugin process exited: path=/opt/nomad/bin/nomad 2017/11/26 23:17:29.676336 [INFO] client: task "try1" for alloc "68b73ef5-069f-6130-22df-b44c6654a57f" completed successfully 2017/11/26 23:17:29.676357 [INFO] client: Restarting task "try1" for alloc "68b73ef5-069f-6130-22df-b44c6654a57f" in 15.483850185s 2017/11/26 23:17:29.698980 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 1 services, 1 checks 2017/11/26 23:17:29.792402 [DEBUG] client: updated allocations at index 455 (total 2) (pulled 0) (filtered 2)

Job file (if appropriate)

`# There can only be a single job definition per file.

Create a job with ID and Name 'example'

job "test1" {

Run the job in the global region, which is the default.

# region = "global"

# Specify the datacenters within the region this job can run in.
datacenters = ["aws"]

# Service type jobs optimize for long-lived services. This is
# the default but we can change to batch for short-lived tasks.
type = "service"

# Priority controls our access to resources and scheduling priority.
# This can be 1 to 100, inclusively, and defaults to 50.
# priority = 50

# Create a 'cache' group. Each task in the group will be
# scheduled onto the same machine.
group "hello" {
    # Control the number of instances of this group.
    # Defaults to 1
    count = 1

    # Define a task to run
    task "try1" {
        # Use Docker to run the task.
        driver = "docker"

        # Configure Docker driver with the image
        config {
              image = "sunileman/nifi1.1.0"
              port_map {
                nifi = 8080
              }
          }
        service {
            name = "${TASKGROUP}-service"
            tags = ["global", "niif"]
            port = "nifi"

    check {
        type     = "http"
        port     = "nifi"
        path     = "/"
        interval = "60s"
        timeout  = "60s"
        initial_status = "passing"
        }

        }

        # We must specify the resources required for
        # this task to ensure it runs on a machine with
        # enough capacity.
        resources {
            cpu = 109 # 500 MHz
            memory = 10 # 128MB
            network {
                mbits = 1
                port "nifi" {
                }
            }
        }

        # Specify configuration related to log rotation
        logs {
            max_files = 10
            max_file_size = 15
        }

        # Controls the timeout between signalling a task it will be killed
        # and killing the task. If not set a default is used.
        kill_timeout = "10s"
    }
}

}`

sunileman commented 6 years ago

I found nomad does not support 0 for memory limit (ie docker without memory tag). It seems running image using docker without defining memory tag works fine. With nomad it seems to be not possible. Nomad will impose hard memory limit.

I say this because if I run image like this: docker run -it sunileman/nifi1.1.0 all is good. nifi comes up

If i run docker image like this: docker run -it --rm --memory="15m" --memory-swappiness=-1 --cpu-shares="20" -p 8080:8080 sunileman/nifi1.1.0

instance will fail to start. How on nomad to run docker image without memory tag. I assume someone may respond that is against the purpose of nomad, as a resource negotiator. I understand that perspective as well.

shantanugadgil commented 6 years ago

hi @sunileman I assume the "Nomad 1.0.0" is a typo, as the current version is only at 0.7.0.

Also, Nomad doesn't allow "no limit" configuration. There are many issues open around that discussion about allowing "over provisioning" of resources.

HTH, Shantanu

sunileman commented 6 years ago

@shantanugadgil my bad, I meant 0.7.0. I was looking at consul version. For some reason many containers run well with "over provisioning". Setting a limit seems to get them in a spin. Might have something to do with JVM alloc.

chelseakomlo commented 6 years ago

Hi, thanks for reporting the issue.

Have you taken a look at the default restart stanza per job type for Nomad? See here: https://www.nomadproject.io/docs/job-specification/restart.html#restart-parameter-defaults. I believe the issue that you are experiencing that the job exits successfully, but Nomad continues to restart it?

sunileman commented 6 years ago

The issue was related to over provisioning. Nomad does not support it. thank you @shantanugadgil

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.