hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

nomad running more allocations than `count` set in jobspec #2419

Closed avichalbadaya closed 2 years ago

avichalbadaya commented 7 years ago

BUG

Nomad version

Nomad v0.5.2

Operating system and Environment details

uname -a
Linux manager-0 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz GenuineIntel GNU/Linux
docker version
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:      linux/amd64

Issue

Nomad showing odd behavior by brining up multiple running allocations (service instances) though count is set to 1.

Job file (if appropriate)

{
    "Job": {
        "Region": "us-east",
        "ID": "core-engine-excel-worker",
        "ParentID": "",
        "Name": "core-engine-excel-worker",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "us-east-1a",
            "us-east-1b",
            "us-east-1c",
            "us-east-1d",
            "us-east-1e"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "deploy-blue",
                "Count": 1,
                "Constraints": [
                    {
                        "LTarget": "${node.class}",
                        "RTarget": "blue",
                        "Operand": "="
                    },
                    {
                        "LTarget": "${attr.vault.version}",
                        "RTarget": "\u003e= 0.6.1",
                        "Operand": "version"
                    },
                    {
                        "LTarget": "${attr.os.signals}",
                        "RTarget": "SIGINT",
                        "Operand": "set_contains"
                    }
                ],
                "Tasks": [
                    {
                        "Name": "core-engine-excel-worker",
                        "Driver": "docker",
                        "User": "",
                        "Config": {
                            "args": [
                                "...."
                            ],
                            "command": "bundle",
                            "image": "....",
                            "labels": [
                                {
                                    "...."
                                }
                            ],
                            "logging": [
                                {
                                    "config": [
                                        {
                                            "splunk-token": "....",
                                            "splunk-url": "...."
                                        }
                                    ],
                                    "type": "splunk"
                                }
                            ],
                            "port_map": [
                                {
                                    "http": 8080
                                }
                            ]
                        },
                        "Constraints": null,
                        "Env": {
                            "...."
                        },
                        "Services": [
                            {
                                "Id": "",
                                "Name": "core-engine-excel-worker",
                                "Tags": [
                                    "daemon"
                                ],
                                "PortLabel": "http",
                                "Checks": null
                            }
                        ],
                        "Resources": {
                            "CPU": 1000,
                            "MemoryMB": 2048,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 10,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null,
                        "Vault": {
                            "Policies": [
                                "..."
                            ],
                            "Env": true,
                            "ChangeMode": "signal",
                            "ChangeSignal": "SIGINT"
                        },
                        "Templates": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Mode": "fail"
                },
                "EphemeralDisk": {
                    "Sticky": false,
                    "Migrate": false,
                    "SizeMB": 300
                },
                "Meta": null
            },
            {
                "Name": "deploy-green",
                "Count": 1,
                "Constraints": [
                    {
                        "LTarget": "${node.class}",
                        "RTarget": "green",
                        "Operand": "="
                    },
                    {
                        "LTarget": "${attr.vault.version}",
                        "RTarget": "\u003e= 0.6.1",
                        "Operand": "version"
                    },
                    {
                        "LTarget": "${attr.os.signals}",
                        "RTarget": "SIGINT",
                        "Operand": "set_contains"
                    }
                ],
                "Tasks": [
                    {
                        "Name": "core-engine-excel-worker",
                        "Driver": "docker",
                        "User": "",
                        "Config": {
                            "args": [
                                "..."
                            ],
                            "command": "bundle",
                            "image": "....",
                            "labels": [
                                {
                                    "branch": "master",
                                    "repo_url": "....",
                                    "sha": "....",
                                    "version": "..."
                                }
                            ],
                            "logging": [
                                {
                                    "config": [
                                        {
                                            "splunk-token": "...",
                                            "splunk-url": "....."
                                        }
                                    ],
                                    "type": "splunk"
                                }
                            ],
                            "port_map": [
                                {
                                    "http": 8080
                                }
                            ]
                        },
                        "Constraints": null,
                        "Env": {
                            "-": "-",
                            ....
                        },
                        "Services": [
                            {
                                "Id": "",
                                "Name": "core-engine-excel-worker",
                                "Tags": [
                                    "daemon"
                                ],
                                "PortLabel": "http",
                                "Checks": null
                            }
                        ],
                        "Resources": {
                            "CPU": 1000,
                            "MemoryMB": 2048,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 10,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null,
                        "Vault": {
                            "Policies": [
                                "..."
                            ],
                            "Env": true,
                            "ChangeMode": "signal",
                            "ChangeSignal": "SIGINT"
                        },
                        "Templates": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Mode": "fail"
                },
                "EphemeralDisk": {
                    "Sticky": false,
                    "Migrate": false,
                    "SizeMB": 300
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": {
            "branch": "...."
        },
        "VaultToken": "",
        "Status": "running",
        "StatusDescription": "",
        "CreateIndex": 13665455,
        "ModifyIndex": 15093248,
        "JobModifyIndex": 15093248
    }
}
$ nomad status core-engine-excel-worker
ID          = core-engine-excel-worker
Name        = core-engine-excel-worker
Type        = service
Priority    = 50
Datacenters = us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e
Status      = running
Periodic    = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
deploy-blue   2       0         2        0       187       1
deploy-green  1       0         0        0       0         0

Placement Failure
Task Group "deploy-green":
  * No nodes are available in datacenter "us-east-1c"
  * No nodes are available in datacenter "us-east-1e"
  * Class "blue" filtered 6 nodes
  * Constraint "${node.class} = green" filtered 6 nodes

Allocations
ID        Eval ID   Node ID   Task Group   Desired  Status    Created At
85e3064b  cdf45bd8  e90e4eda  deploy-blue  run      running   03/09/17 02:51:12 UTC
dcd6af65  cdf45bd8  0d18b0ac  deploy-blue  run      running   03/09/17 02:51:12 UTC
....
dadgar commented 7 years ago

Hey do you have the server logs from the evaluation cdf45bd8

dadgar commented 7 years ago

Also is this reproducible?

sheldonkwok commented 7 years ago

I encountered this bug as well. Running the same job did not result in lowering the count to 1 but stopping the job and running the exact same config returned the count to 1.

Corrupted the state on our application that was only supposed to have one instance running.

tantra35 commented 7 years ago

I think that we discover this bug, since nomad 0.4.x, but couldn't understand. For example today i see follow (we use nomad 0.5.4):

root@social:/home/ruslan# nomad status townshipDynamoTeamServer
ID            = townshipDynamoTeamServer
Name          = townshipDynamoTeamServer
Type          = service
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                Queued  Starting  Running  Failed  Complete  Lost
townshipDynamoTeamServer  0       0         2        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group                Desired  Status   Created At
f2beb531  bbe429c9  ec475f0a  townshipDynamoTeamServer  run      running  03/16/17 18:04:19 MSK
e038ed4d  92f3402f  439a2f5a  townshipDynamoTeamServer  stop     running  03/14/17 14:19:13 MSK

As you can see second task still in runing state, and nomad doesn't try to stop it,

In allocation e038ed4d i see follow:

ID                  = e038ed4d
Eval ID             = 92f3402f
Name                = townshipDynamoTeamServer.townshipDynamoTeamServer[0]
Node ID             = 439a2f5a
Job ID              = townshipDynamoTeamServer
Client Status       = running
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 03/14/17 14:19:13 MSK

Task "fluend" is "running"
Task Resources
CPU     Memory   Disk  IOPS  Addresses
50 MHz  300 MiB  0 B   0

Recent Events:
Time                   Type        Description
03/16/17 18:04:06 MSK  Started     Task started by client
03/16/17 18:03:51 MSK  Restarting  Task restarting in 15.210495459s
03/16/17 18:03:51 MSK  Terminated  Exit Code: 0
03/16/17 18:03:51 MSK  Started     Task started by client
03/16/17 18:03:35 MSK  Restarting  Task restarting in 15.252496882s
03/16/17 18:03:35 MSK  Terminated  Exit Code: 0
03/16/17 18:03:35 MSK  Started     Task started by client
03/16/17 18:03:29 MSK  Restarting  Exceeded allowed attempts, applying a delay - Task restarting in 5.752060261s
03/16/17 18:03:29 MSK  Terminated  Exit Code: 0
03/16/17 18:03:29 MSK  Started     Task started by client

Task "townshipDynamoTeamServer" is "pending"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
150 MHz  600 MiB  0 B   0

Recent Events:
Time                   Type            Description
03/16/17 18:03:50 MSK  Restarting      Task restarting in 16.227339s
03/16/17 18:03:50 MSK  Driver Failure  failed to start task "townshipDynamoTeamServer" for alloc "e038ed4d-bff4-c371-a39c-1171a65d376b": Failed to create container: no such image
03/16/17 18:03:41 MSK  Restarting      Exceeded allowed attempts, applying a delay - Task restarting in 9.570411234s
03/16/17 18:03:41 MSK  Driver Failure  failed to start task "townshipDynamoTeamServer" for alloc "e038ed4d-bff4-c371-a39c-1171a65d376b": Failed to create container: no such image
03/16/17 18:03:24 MSK  Restarting      Task restarting in 16.126247999s
03/16/17 18:03:24 MSK  Driver Failure  failed to start task "townshipDynamoTeamServer" for alloc "e038ed4d-bff4-c371-a39c-1171a65d376b": Failed to create container: no such image
03/16/17 18:03:07 MSK  Restarting      Task restarting in 16.569612059s
03/16/17 18:03:07 MSK  Driver Failure  failed to start task "townshipDynamoTeamServer" for alloc "e038ed4d-bff4-c371-a39c-1171a65d376b": Failed to create container: no such image
03/16/17 18:02:50 MSK  Restarting      Task restarting in 16.761964645s
03/16/17 18:02:50 MSK  Driver Failure  failed to start task "townshipDynamoTeamServer" for alloc "e038ed4d-bff4-c371-a39c-1171a65d376b": Failed to create container: no such image

After this manimulations in the end i see (second bad allocation disappeared):

ID            = townshipDynamoTeamServer
Name          = townshipDynamoTeamServer
Type          = service
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                Queued  Starting  Running  Failed  Complete  Lost
townshipDynamoTeamServer  0       0         2        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group                Desired  Status   Created At
f2beb531  bbe429c9  ec475f0a  townshipDynamoTeamServer  run      running  03/16/17 18:04:19 MSK

As you can see second bad alloaction disappeared, but Running steel show 2 instead 1(we have job with taskgroub with count=1)

dadgar commented 7 years ago

@tantra35 That looks confusing but I think it is behaving normally and there is a bug in the Summary. The node has lost connection to the servers: "alloc is lost since its node is down". That is why the server has replaced that allocation. Up until 0.5.x, we would have stop running in that state. Now we mark the client status as lost.

tantra35 commented 7 years ago

@dadgar I think in this situation you right, but I'm confused why Running still 2 but not 1

dadgar commented 7 years ago

@tantra35 Yeah I believe that is just a bug from the 0.4.x releases. I believe it should be fixed now. I just ran a test:

nomad status e
ID            = example
Name          = example
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         1

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
42173a80  e78e5c90  8f6984c9  cache       run      running  03/16/17 18:11:50 UTC
49d20a79  71d4f3ad  564daec4  cache       stop     lost     03/16/17 18:11:21 UTC
tantra35 commented 7 years ago

We use nomad 0.5.4. Perhaps this happens in our case due, that job doens't have any chance to launch due config erros in docker overlay network. On host where nomad placed townshipDynamoTeamServer job doсker doesn't have valid overlay network config, so nomad try and try to restart job without any success and ignores server decision, which say to place this job on another node, so we have 2 jobs one of which can't stop due bug in nomad client code

Nomon commented 7 years ago

seeing the same issue with nomad 0.5.5-rc2. Updating the job sure updates all allocs but the number of running allocations for the api task group stays at desired count + 6. So setting the count to 1 results in 7 allocs, setting the count to 4 (like shown below) updates 10 allocations.

ubuntu@ip-172-18-3-33:~$ nomad status gameservices-event-processor 
ID            = gameservices-event-processor
Name          = gameservices-event-processor
Type          = service
Priority      = 50
Datacenters   = us-west-2-staging
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
api             0       0         10       0       196       0
redis           0       0         1        0       12        0
scheduled-jobs  0       0         1        0       4         0

Allocations
ID        Eval ID   Node ID   Task Group      Desired  Status   Created At
bc92ab08  298019d9  ec231d1a  api             run      running  03/23/17 13:01:09 UTC
d2ffddaf  4193b1c0  ec2792ea  api             run      running  03/23/17 13:00:39 UTC
b901fafc  bdb6c5f0  ec231d1a  redis           run      running  03/23/17 13:00:09 UTC
57b10239  868ee9b1  ec2bdb6a  api             run      running  03/23/17 12:59:39 UTC
cd700489  902f9648  ec2b7ee4  api             run      running  03/23/17 12:59:09 UTC
b24aafb4  77ac72f4  ec2792ea  api             run      running  03/23/17 12:58:39 UTC
71e32b21  d4b4ee30  ec2443ff  api             run      running  03/23/17 12:58:09 UTC
78e56abd  6b046106  ec2ee3a1  api             run      running  03/23/17 12:57:39 UTC
2ee09f0a  e2e040b6  ec2215b1  api             run      running  03/23/17 12:57:09 UTC
941c40c1  6af53ba9  ec2443ff  api             run      running  03/23/17 12:56:39 UTC
99b368c0  3462985e  ec2215b1  api             run      running  03/23/17 12:56:09 UTC
69d1e0cc  3462985e  ec243a8b  scheduled-jobs  run      running  03/23/17 12:56:09 UTC
ubuntu@ip-172-18-3-33:~$ nomad inspect -t $'{{ range .TaskGroups}}{{.Name}} count: {{.Count}}\n{{end}}' gameservices-event-processor
redis count: 1
api count: 4
scheduled-jobs count: 1
dadgar commented 7 years ago

@Nomon Do you have any steps to reproduce would love to get this fixed but haven't been able to reproduce!

BSick7 commented 7 years ago

@dadgar There are possible repro steps in issue #2487.

tgross commented 2 years ago

👋 Doing some issue cleanup and it looks like this can be closed according to https://github.com/hashicorp/nomad/issues/2487.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.