hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.73k stars 1.94k forks source link

Succesfully completed batch job is re-run with new allocation. #4299

Closed nugend closed 5 years ago

nugend commented 6 years ago

Nomad version

Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)

Operating system and Environment details

3.10.0-327.36.3.el7.x86_64

Issue

A batch job executed, completed successfully and then several hours later, when the allocation was garbage collected was re-run.

Reproduction steps

Not sure. Seems to be happening frequently on our cluster though.

Nomad logs

    2018/05/14 23:30:50.541581 [DEBUG] worker: dequeued evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.541765 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/14 23:30:50.546101 [DEBUG] worker: submitted plan at index 355103 for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa
    2018/05/14 23:30:50.546140 [DEBUG] sched: <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/14 23:30:50.547618 [DEBUG] worker: updated evaluation <Eval "5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/14 23:30:50.547683 [DEBUG] worker: ack for evaluation 5e2dfa95-ce49-9e4d-621d-d0e900d6c3aa

    2018/05/14 23:30:52.074437 [DEBUG] client: starting task runners for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/14 23:30:52.074769 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '5d0016ac-dd71-6626-f929-6398e80ef28e')
2018-05-14T23:30:52.085-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/5d0016ac-dd71-6626-f929-6398e80ef28e/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/14 23:34:32.288406 [INFO] client: task "REDACTED-task" for alloc "5d0016ac-dd71-6626-f929-6398e80ef28e" completed successfully
    2018/05/14 23:34:32.288438 [INFO] client: Not restarting task: REDACTED-task for alloc: 5d0016ac-dd71-6626-f929-6398e80ef28e
    2018/05/14 23:34:32.289213 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:13.888635 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 01:39:15.389175 [WARN] client: failed to broadcast update to allocation "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 01:39:15.389401 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 01:39:15.390656 [DEBUG] client: terminating runner for alloc '5d0016ac-dd71-6626-f929-6398e80ef28e'
    2018/05/15 01:39:15.390714 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 04:25:10.541590 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to new allocations and over max (500)
    2018/05/15 04:25:10.541626 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.119467 [DEBUG] worker: dequeued evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.139904 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
    2018/05/15 05:46:37.169051 [INFO] client.gc: marking allocation 5d0016ac-dd71-6626-f929-6398e80ef28e for GC
    2018/05/15 05:46:37.169149 [INFO] client.gc: garbage collecting allocation 5d0016ac-dd71-6626-f929-6398e80ef28e due to forced collection
    2018/05/15 05:46:37.169194 [DEBUG] client.gc: garbage collected "5d0016ac-dd71-6626-f929-6398e80ef28e"
    2018/05/15 05:46:37.177470 [DEBUG] worker: submitted plan at index 373181 for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:37.177516 [DEBUG] sched: <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">: setting status to complete
    2018/05/15 05:46:37.179391 [DEBUG] worker: updated evaluation <Eval "e15e469e-e4f5-2192-207b-84f6a17fd25f" JobID: "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2" Namespace: "default">
    2018/05/15 05:46:37.179783 [DEBUG] worker: ack for evaluation e15e469e-e4f5-2192-207b-84f6a17fd25f
    2018/05/15 05:46:40.218701 [DEBUG] client: starting task runners for alloc '928b0562-b7ed-a3c7-d989-89519edadee9'
    2018/05/15 05:46:40.218982 [DEBUG] client: starting task context for 'REDACTED-task' (alloc '928b0562-b7ed-a3c7-d989-89519edadee9')
2018-05-15T05:46:40.230-0400 [DEBUG] plugin: starting plugin: path=REDACTED/bin/nomad args="[REDACTED/nomad executor {"LogFile":"REDACTED/alloc/928b0562-b7ed-a3c7-d989-89519edadee9/REDACTED-task/executor.out","LogLevel":"DEBUG"}]"
    2018/05/15 11:50:17.836313 [INFO] client: task "REDACTED-task" for alloc "928b0562-b7ed-a3c7-d989-89519edadee9" completed successfully
    2018/05/15 11:50:17.836336 [INFO] client: Not restarting task: REDACTED-task for alloc: 928b0562-b7ed-a3c7-d989-89519edadee9
    2018/05/15 11:50:17.836698 [INFO] client.gc: marking allocation 928b0562-b7ed-a3c7-d989-89519edadee9 for GC

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.unique.id}",
                "Operand": "=",
                "RTarget": "52c7e5be-a5a0-3a34-1051-5209a91a0197"
            }
        ],
        "CreateIndex": 393646,
        "Datacenters": [
            "dc1"
        ],
        "ID": "REDACTED-9d565598-63f2-4c2d-b506-5ae64d4397a2",
        "JobModifyIndex": 393646,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 393673,
        "Name": "REDACTED",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "REDACTED/dispatch-1526033570-3cdd72d9",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Stable": false,
        "Status": "dead",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1526403442162340993,
        "TaskGroups": [
            {
                "Constraints": [
                    {
                        "LTarget": "${attr.os.signals}",
                        "Operand": "set_contains",
                        "RTarget": "SIGTERM"
                    }
                ],
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "REDACTED",
                "ReschedulePolicy": {
                    "Attempts": 1,
                    "Delay": 5000000000,
                    "DelayFunction": "constant",
                    "Interval": 86400000000000,
                    "MaxDelay": 0,
                    "Unlimited": false
                },
                "RestartPolicy": {
                    "Attempts": 1,
                    "Delay": 15000000000,
                    "Interval": 86400000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": null,
                        "Config": {
                            "command": "REDACTED",
                            "args": [REDACTED]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {REDACTED},
                        "KillSignal": "SIGTERM",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "REDACTED",
                        "Resources": {
                            "CPU": 100,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "Networks": null
                        },
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": null
            }
        ],
        "Type": "batch",
        "Update": {
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 0
    }
}

What I can tell you for sure is that the allocation ran to completion and exited successfully.

We're going to try turning off the reschedule and restart policies to see if that has any effect since we're taking care of re-running these on any sort of job failure anyway.

Lamboona commented 6 years ago

We have also been having this behaviour since the rechedule policies were introduced in 0.8. We use dispatch to schedule paramaterized jobs, but we also noticed the same behaviour, at some point towards the end of gc cycle it decides that the job never ran and runs it again with a new allocation.

We currently have a partial workaround, but would really appreciate a fix.

We run 3 servers tied into an asg of clients if that helps at all.

chelseakomlo commented 6 years ago

Thanks for reporting this issue- we will investigate this further on our end.

nugend commented 6 years ago

No effect. We will be rolling back to a previous version of Nomad.

@Lamboona Could you let me know what your workaround was?

Lamboona commented 6 years ago

@nugend We run a lot of ephemeral jobs with small units of work, so we just discounted the work that was done out of order. It only partially works, but the extra work it causes is negligible enough that we can live with it, for now.

The benefit we get from having the reschedule policy in the firstplace far outweighs the cost of the extra work.

dadgar commented 6 years ago

@nugend or @Lamboona Can you show the allocation output from the API of the original and retried allocations and the evaluations attached to the job

nugend commented 6 years ago

@dadgar At what point? The original allocation is garbage collected when the logs say its garbage collected. Do you mean I need to guess which allocation is going to experience this issue and capture it?

Also, we're not running this version right now. I can request that we deploy this to our QA environment again if this is vital to your debugging efforts, but I'm disinclined to return to the affected version.

preetapan commented 6 years ago

@nugend - I've been looking at this and trying to reproduce. If you have debug logging turned on and the garbage collector collected this job, you should see log lines like following if the server tried to force garbage collect the completed allocation

2018/05/21 23:09:49.869355 [DEBUG] sched.core: eval GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.869680 [DEBUG] worker: ack for evaluation ed3bc611-108d-d651-3ee5-5f747e9ca239
    2018/05/21 23:09:49.869941 [DEBUG] worker: dequeued evaluation eb86564b-eddb-38e2-7a38-27d7344632a2
    2018/05/21 23:09:49.870068 [DEBUG] sched.core: job GC: scanning before index 27 (5m0s)
    2018/05/21 23:09:49.870170 [DEBUG] sched.core: job GC: 1 jobs, 1 evaluations, 1 allocs eligible

You haven't included server logs, would you be able to attach server logs from around the time that you think it tried to restart?

Another thing I noticed from the job file pasted is that the ID contains dispatch- so it looks like a parameterized job. Could you confirm that? I tried to create and dispatch a parameterized job with nomad job dispatch, and after it garbage collected, I ran nomad job status on it and saw output like the following. It did not create any more additional allocations, until nomad job dispatch is run again.

$nomad job status dispatch_test
ID            = dispatch_test
Name          = dispatch_test
Submit Date   = 2018-05-21T22:51:56-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = true

Parameterized Job
Payload           = required
Required Metadata = test_key
Optional Metadata = <none>

Parameterized Job Summary
Pending  Running  Dead
0        0        2

No dispatched instances of parameterized job found

And we'd still like the info @dadgar requested - you don't have to guess which allocs will get restarted. You should be able to capture the alloc status output when you first dispatch the job (before the 24 hour window that it uses to garbage collect), and the output of nomad job status -evals REDACTED/dispatch-***. If you can do that before, and after you see it restart with another new allocation, that would help narrow down what's going on. In other words, the steps/API output to collect are:

1) nomad job status -evals redacted-*** right after you first dispatch the job 2) nomad eval status -json of any eval IDs returned in step 1 3) nomad alloc status -json of any allocation IDs returned in step 2 4) steps 1-3 after* you see the problem described here where new allocations were created for the same job that had already completed.

Thanks for the help in narrowing this down, I realize this is a lot to ask but details on the above are greatly appreciated.

nugend commented 6 years ago

@preetapan We were running three nodes in a combined server/client mode. I may have failed to configure the logging correctly to produce logs from both server and client or otherwise, simply did not include those specific log lines. I'll check to see if we still have those particular log statements in this case.

The job itself was not parameterized, but its parent was. The parent was not re-executed at any point (it runs for a very long time since those jobs act as dynamic schedulers of sub-tasks which they monitor for work conditions).

I'll see if we can automate that collection. The problem is that these one-off batch jobs are created dynamically as the children of the aforementioned parameterized job (or rather, of the many parameterized jobs we've engineered to work in this way). That is why I'd have to speculatively collect them.

At any rate, it may be a bit before I can justify working on this.

preetapan commented 6 years ago

The job itself was not parameterized, but its parent was.

Yes that is expected - each child job has to be invoked explicitly via nomad job dispatch though. When trying to repro, I only see one allocation per nomad job dispatch of the parent, and then it eventually garbage collects those.

Can you share the parent job's spec while we wait for more info about the job/allocations?

preetapan commented 6 years ago

Sharing job spec of job I used to try to repro:

 $ cat parameterized.nomad
job "dispatch_test" {
  datacenters = ["dc1"]

  type = "batch"

  parameterized {
    payload       = "required"
    meta_required = ["test_key"]
  }

  group "t1" {

    task "t1" {
      driver = "raw_exec"

      config {
        command = "echo"

        # When dispatched, the payload is written to a file that is then read by
        # the created task upon startup
        args = ["world"]
      }

      dispatch_payload {
        file = "config.json"
      }
    }
  }
}
 $cat config.json
{
"hey":"ho"
}

$ nomad job dispatch -meta test_key=foo dispatch_test config.json
nugend commented 6 years ago

No, you are misunderstanding: The job which is rescheduled is not a parameterized job. It is a one-off batch job that has been customized to run a particular batch load. The parent of that one-off job is a dispatched instance of a parameter job. I manually specified the dispatch instance as parent in the creation of the one-off batch job.

endocrimes commented 6 years ago

FWIW - We've seen some similar issues with one shot batch jobs being rescheduled during our 0.8.3 testing, I hadn't made the connection to GC yet, but I can take a look for that in the logs tomorrow.

preetapan commented 6 years ago

@DanToml thanks, that would help. Having the output of v1/allocation/<id> of one of these batch jobs that were rescheduled would be useful as well.

@nugend I understand now, thanks for clarifying.

endocrimes commented 6 years ago

@preetapan Sadly, we had to put our 0.8.3 testing on hold for a couple of weeks after running into this, but I'll see if I can get a repro case or do more digging early next week

preetapan commented 6 years ago

@Lamboona will you be able to provide any debug info here - I need the output of v1/allocation/<id> of both the completed and the rescheduled allocation. Unfortunately, without that, it is not possible to piece together exactly what happened when I can't reproduce this issue with any test batch jobs I've written.

dadgar commented 6 years ago

To add more to what Preetha has asked for. What would be ideal is the following:

nugend commented 6 years ago

I don't know about the other users, but in our configurations, nomad is running in both client and server mode, so if there's a way to get the allocation information from the server and not the client, I would love to know how.

dadgar commented 6 years ago

@nugend The endpoint I gave would work in that mode /v1/allocation/<alloc_id>

Lamboona commented 6 years ago

@preetapan we run the GC more frequently, so i will have to make some environment changes to actually be able to inspect both of the allocations at once, we only discovered them because of the effect that they were having. If i can get a chance tomorrow i will spin up an environment with debug and the normal gc settings, to try and get some of these stats.

Lamboona commented 6 years ago

Here are some debug logs that i have managed to scrape together this is the original allocation. I have copied them out of kibana, so they may not be perfect, these are just the logs pertaining to the allocation, as there are too many surrounding it for me to put more in

May 25 12:10:14 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:14 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]:     2018-05-25T12:10:53.209Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/d5df6b11-b221-a600-3050-ddd238616736/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218924 [INFO] client: task "tools-graph-insertion" for alloc "d5df6b11-b221-a600-3050-ddd238616736" completed successfully
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.218950 [INFO] client: Not restarting task: tools-graph-insertion for alloc: d5df6b11-b221-a600-3050-ddd238616736
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:33 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:33.237517 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:34 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:34.553372 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (53) is over the limit (50)
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.065482 [WARN] client: failed to broadcast update to allocation "d5df6b11-b221-a600-3050-ddd238616736"
May 25 12:11:36 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:36.089629 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:11:37 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:11:37.634305 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (51) is over the limit (50)
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.806199 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811470 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to forced collection
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.224329 [INFO] client.gc: marking allocation d5df6b11-b221-a600-3050-ddd238616736 for GC
May 25 12:20:25 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:25.101524 [INFO] client.gc: garbage collecting allocation d5df6b11-b221-a600-3050-ddd238616736 due to number of allocations (52) is over the limit (50)

The following allocation 0350af5e-4203-7976-f90f-cfddae3b573f was running again by 12.12.11

May 25 12:10:10 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:10:10 [INFO] (runner) rendered "(dynamic)" => "/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/local/tools.env"
May 25 12:10:53 ip-10-160-67-191 nomad[9620]:     2018-05-25T12:10:53.211Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/mnt/nomad/alloc/0350af5e-4203-7976-f90f-cfddae3b573f/tools-graph-insertion/executor.out","LogLevel":"INFO"}]"
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341873 [INFO] client: task "tools-graph-insertion" for alloc "0350af5e-4203-7976-f90f-cfddae3b573f" completed successfully
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.341901 [INFO] client: Not restarting task: tools-graph-insertion for alloc: 0350af5e-4203-7976-f90f-cfddae3b573f
May 25 12:12:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:12:09.361617 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:15:09 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:09.811342 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection
May 25 12:15:17 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:15:17.372589 [WARN] client: failed to broadcast update to allocation "0350af5e-4203-7976-f90f-cfddae3b573f"
May 25 12:20:23 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:23.110146 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
                                                  2018-05-25T12:20:23.116Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-05-25T12:20:23.114Z address=/tmp/plugin814132855 network=unix
May 25 12:20:24 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:20:24.758808 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to number of allocations (52) is over the limit (50)
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.535147 [INFO] client.gc: marking allocation 0350af5e-4203-7976-f90f-cfddae3b573f for GC
May 25 12:51:18 ip-10-160-67-191 nomad[9620]:     2018/05/25 12:51:18.540148 [INFO] client.gc: garbage collecting allocation 0350af5e-4203-7976-f90f-cfddae3b573f due to forced collection

I haven't been able to get any of the api responses for these out yet, having debugging turned on seems to cause the servers to bottleneck and then they dont re-elect a leader. I will try forcing this again but with debugging turned off

One thing i have noticed looking at these logs is that the runner log messages seem to have occured for both of the allocations before either of the tasks actually ran, for a little more information the job itself would have been blocked for some time before it was able to be allocated. They are posted to the cluster using /v1/job/:job_id/dispatch

TBC

Lamboona commented 6 years ago

@preetapan I have managed to get some of the information out of the API.

It would seem that our problem doesn't actually stem from the garbage collection but actually happens far earlier in the process, as the batch scheduler seems to be scheduling a single dispatched job multiple times. I can attach the alloc/eval output if you want or should i open a seperate issue? As i feel anything else i add might detract from the issue that others are having.

preetapan commented 6 years ago

@Lamboona noted about the GC logs. I am pretty sure that's a red herring. There has to be another eval for a previously completed batch job to reschedule. So the details of the evals and allocs of the dispatched job would help here.

Lamboona commented 6 years ago

@preetapan Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case. Which is why i think this might be a seperate issue.

If you want me to post the output from the allocs/evals they aren't exactly small do you want them pasted directly?

preetapan commented 6 years ago

Thats the problem the repeated jobs we are seeing have seperate job IDs, previously with the GC being so frequent we hadn't actually seen that this was the case.

That's interesting that they have different job ids, that usually means that something else external is calling nomad job dispatch on the parameterized job.

You can add them as attachments here rather than posting them directly as comments if they are huge. Please attach a job spec as well.

I can go ahead and create another issue as necessary but right now this ticket is fine.

Lamboona commented 6 years ago

allocation_output.zip

I wasn't able to attach just Json files so i have zipped them together, this shows two seperate allocations that go through seperate evaluations. The jobs themselves would have been created by a single post to the HTTP API dispatch endpoint, the TOOL_RUN_ID is the way i can link them together, as these are generated at dispatch as a uuidv4.

preetapan commented 6 years ago

@Lamboona Having looked through all the allocs/evals I can rule out that this is anything to do with rescheduling. All the evals have a TriggeredBy reason of "job-register", so the second set of dispatched jobs and the allocations for them were not triggered by the scheduler.

One thing I noticed from the SubmitTime field of the jobs in the two allocs was that they were close to 3 seconds apart.

$grep SubmitTime alloc2.json 
        "SubmitTime": 1527261125866757635,
$grep SubmitTime alloc1.json 
        "SubmitTime": 1527261128492329485,

That's telling me that those two jobIDS were created by two different calls to the job dispatch server end point, which in turn are triggered by calls to the HTTP dispatch end point.

Do you use any http client libraries when wrapping calls to Nomad, perhaps that layer is doing a retry?

Lamboona commented 6 years ago

@preetapan No we don't use any client libraries specifically for this reason. The only route that can cause a job to be run a second time is if we recieve a non 200 response code, in which case this ID could possibly be reused, is there any possibility of a job being created with a non 200 response from the http api?

preetapan commented 6 years ago

@Lamboona sorry for the late response. You might get a 500 response if there was an unspecified server side error, but in that case the job would not have been dispatched at all. You can also get permission errors but those are safe to retry. Will have to investigate this some more to see if there's some other edge case we missed.

Do you log anything on your end before you call the nomad dispatch endpoint? Adding some logging with the job id and the TOOL_RUN_ID there and recording the http response code from nomad might give you more insight into how the second run is being triggered.

camerondavison commented 6 years ago

i saw this happen where the eval status said that it was triggered by job-deregister

➜  Code nomad eval status ea4780f0
ID                 = ea4780f0
Status             = complete
Status Description = complete
Type               = batch
TriggeredBy        = job-deregister
Job ID             = ___-pipeline-job/dispatch-1529524791-5ec718f5
Priority           = 50
Placement Failures = false

when I run nomad alloc status the job itself has been garbage collected, but the alloc is still running.

kill the node killed the alloc and it did not get retried

preetapan commented 6 years ago

@camerondavison its probably too late this time, but next time if you happen to see this can you capture the json output for v1/allocation/<id> and v1/evaluation/<id> as well.

wyattanderson commented 5 years ago

I've been noticing this more and more recently as well, and unfortunately only usually notice it well after the allocations have been garbage collected, which happens extremely frequently in our environment due to our job volume. When I've seen it, I've seen the exact same dispatched job ID in the new allocation (REDACTED/dispatch-1534540234-a09077f0/periodic-1536645600). The parameterized part of the job is optional and we don't have any other systems that would have redispatched the job other than the periodic scheduler. The retries usually occur several hours apart, and the first allocation has always succeeded.

Here's what I think are the relevant bits of the job configuration; I can add more detail if necessary later on. We don't currently have a reschedule stanza on this job.

job "REDACTED" {
    type = "batch"

    periodic {
        cron = "0 2 * * * *"
        time_zone = "America/New_York"
    }

    group "REDACTED" {
        restart {
            attempts = 1
            mode = "fail"
        }
    }
}

I'll see what I can do to collect allocation and evaluation logs when the job completes, but I figured I'd jump on this issue in the meantime.

wyattanderson commented 5 years ago

A hunch: our servers (for historical reasons) have their time zone set to EST5EDT, which is UTC -4. I haven't confirmed this yet, but I've noticed that for a few duplicate batch jobs we've seen, they've come in almost exactly 4 hours after the job was originally scheduled. Is it possible that there's a faulty comparison somewhere between tz-aware and tz-naive datetimes?

Can other folks maybe chime in with their system clock settings?

shantanugadgil commented 5 years ago

@wyattanderson We run our entire infrastructure on UTC time (though things could be in any timezone)

Would it be possible to switch your setups to UTC and see if the problem goes away (or occurs anyway)?

notnoopci commented 5 years ago

In a test environment, I was able to hit this issue 1-5% depending on various configurations and potentially the position of the moon. I added my notes and reproduction steps and logs in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug .

It does seem that problem is related to GC - and looks like job deregistering somehow resulting into new allocations for re-runs. This matches @camerondavison observation of the eval being job-deregister. I suspect it's a race in purging of job and allocation from state and evaluating the job-deregister eval.

We are digging into this further and we'll be happy to assist in debugging this or testing any hypothesis/patches, as this is blocking our adoption of Nomad 0.8.

wyattanderson commented 5 years ago

Finally able to capture this.

The redispatched allocation:

{
    "CreateIndex": 77003310, 
    "JobVersion": 0, 
    "ClientStatus": "running", 
    "DesiredTransition": {
        "Migrate": null, 
        "ForceReschedule": null, 
        "Reschedule": null
    }, 
    "DeploymentStatus": null, 
    "ModifyIndex": 77003764, 
    "DesiredDescription": "", 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "ClientDescription": "", 
    "TaskGroup": "REDACTED", 
    "CreateTime": 1538599086273802244, 
    "NodeID": "bd770e44-1610-7479-10fe-4d2b8e9c398d", 
    "ModifyTime": 1538599088473645875, 
    "DesiredStatus": "run", 
    "Name": "REDACTED/dispatch-1538582681-b197aba4.REDACTED[0]", 
    "EvalID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "FollowupEvalID": "", 
    "ID": "3a5945f4-b606-86c8-be60-e29d7b77a7e8", 
    "RescheduleTracker": null, 
    "TaskStates": {
        "task1": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.363318471Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847401470, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847468088, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088351381536, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }, 
        "task2": {
            "LastRestart": "0001-01-01T00:00:00Z", 
            "Restarts": 0, 
            "Failed": false, 
            "State": "running", 
            "FinishedAt": "0001-01-01T00:00:00Z", 
            "StartedAt": "2018-10-03T20:38:08.316265838Z", 
            "Events": [
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task received by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Received", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847317486, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Building Task Directory", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {
                        "message": "Building Task Directory"
                    }, 
                    "TaskSignalReason": "", 
                    "Type": "Task Setup", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599087847457864, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "Building Task Directory", 
                    "ExitCode": 0
                }, 
                {
                    "DiskLimit": 0, 
                    "DisplayMessage": "Task started by client", 
                    "TaskSignal": "", 
                    "KillTimeout": 0, 
                    "VaultError": "", 
                    "Details": {}, 
                    "TaskSignalReason": "", 
                    "Type": "Started", 
                    "FailsTask": false, 
                    "DriverError": "", 
                    "KillReason": "", 
                    "GenericSource": "", 
                    "FailedSibling": "", 
                    "ValidationError": "", 
                    "Signal": 0, 
                    "KillError": "", 
                    "DownloadError": "", 
                    "Time": 1538599088304795991, 
                    "RestartReason": "", 
                    "DriverMessage": "", 
                    "StartDelay": 0, 
                    "SetupError": "", 
                    "Message": "", 
                    "ExitCode": 0
                }
            ]
        }
    }
}

and the evaluation:

{
    "PreviousEval": "", 
    "Namespace": "default", 
    "ModifyIndex": 77003323, 
    "JobID": "REDACTED/dispatch-1538582681-b197aba4", 
    "Priority": 30, 
    "WaitUntil": "0001-01-01T00:00:00Z", 
    "CreateIndex": 77003266, 
    "Type": "batch", 
    "TriggeredBy": "job-deregister", 
    "Status": "complete", 
    "AnnotatePlan": false, 
    "QueuedAllocations": {
        "REDACTED": 0
    }, 
    "NextEval": "", 
    "JobModifyIndex": 0, 
    "DeploymentID": "", 
    "ID": "1be2b1c5-0d98-41e0-4846-eb7898185bf9", 
    "EscapedComputedClass": false, 
    "Wait": 0, 
    "ClassEligibility": null, 
    "FailedTGAllocs": null, 
    "StatusDescription": "", 
    "QuotaLimitReached": "", 
    "NodeID": "", 
    "SnapshotIndex": 77003266, 
    "BlockedEval": "", 
    "LeaderACL": "", 
    "NodeModifyIndex": 0
}
notnoopci commented 5 years ago

@wyattanderson I think that matches my observation too with job-deregister evals and GC being the trigger too.

One workaround is to run separate GC process that watches for nomad completed jobs and immediately calls Deregister on them directly before the core scheduler Jobs GC process them.

nugend commented 5 years ago

Has anyone seen any of this behavior cease after upgrading to 0.8.6 or anything?

notnoopci commented 5 years ago

@nugend I was able to reproduce it with latest nomad master in my tests. In simulated environment, two jobs re-run that's mentioned in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug/blob/master/run-1538770428/output.txt#L47-L48

nugend commented 5 years ago

Holy moly, I didn't notice how much work you've put into debugging this already!

Since you have a script to reproduce and detect the issue, have you tried running it through git bisect yet? I guess it'd take a while.

I know for a fact you can put the start at 0.7.1 (we're using that in Production and haven't seen the issue).

nugend commented 5 years ago

We've noticed that we have some jobs which failed and were explicitly configured to not restart or reschedule just get scheduled again on 0.7.1

I can only conclude that this has always been an issue and it's just certain garbage collection policy changes that cause it to be more or less prevalent.

notnoop commented 5 years ago

@nugend interesting. You are right in that it's GC change, as I believe the relevant change is https://github.com/hashicorp/nomad/pull/3982 . I have a WIP fix in https://github.com/hashicorp/nomad/pull/4744, but we are researching other ways to mitigate the condition before committing that PR.

I'm not so sure about pre-0.8 - in my previous job, prior to upgrading Nomad to 0.8, we saw job retries rate of ~0.05%, that we attributed to noise and infrastructure related failures; but the rate jumped to 2-3% when we upgraded to 0.8. It's possible that there is another triggering condition that I missed in the noise.

When possible, can you provide some logs and API results for the re-run jobs and associated evaluations/allocations as suggested above - that would help us tremendously.

FYI - I just joined the Nomad team, and I'll be working on this issue very soon again.

notnoop commented 5 years ago

@nugend Thanks for your patience! We have resolved this issue and intend to ship a point to release with the fix soon. Would you be able to vet a test build in your environment soon? I'll update the ticket as soon as we cut.a release.

nugend commented 5 years ago

I've already built a release from the master branch. We'll be testing it in a few days.

preetapan commented 5 years ago

@nugend master has a lot of churn due to ongoing work with driver plugins, so would definitely not recommend running a build based on master in production.

camerondavison commented 5 years ago

@notnoop would this problem also manifest itself on batch allocations that fail? I am seeing a batch job run too many times (set the restart policy to only 1) and once it failed it still restarted.

shantanugadgil commented 5 years ago

I set the restart parameter to 0 (zero) for my AMI build jobs which I want to run only once.

camerondavison commented 5 years ago

sorry. I meant to say that it restarted 4 times at least, still in process of restarting.

nugend commented 5 years ago

Just want to say this does seem to be fixed in 0.8.7 and I would suggest anyone experiencing it on an earlier build should try upgrading.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.