hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

logmon "Unrecognized remote plugin" errors #6715

Open zonnie opened 4 years ago

zonnie commented 4 years ago

A job task is stuck in Starting state while it seems to be running fine

Nomad version

Nomad v0.9.5-gc-1

Operating system and Environment details

Linux machine-1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue

Looking for the status of c-1 I see the following:

root@machine-1:~# nomad status
c-1                                     service  100       running  2019-11-17T20:45:13Z

This appears to be running according to Nomad

But when I dig deeper I see it's still in Starting for some reason, the single task that exists is in Starting

root@machine-1:~# nomad job status c-1
ID            = c-1
Name          = c-1
Submit Date   = 2019-11-17T20:45:13Z
Type          = service
Priority      = 100
Datacenters   = gc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
c-1     0       1         0        0       0         0

Allocations
ID        Node ID   Task Group      Version  Desired  Status   Created     Modified
a24ac3a0  b1dd5a5b c-1  0        run      pending  12h55m ago  10h41m ago

The problem is that the container seems to be running fine. This completely fails a process in which I wait for my containers to be in Running state (using the job/job-id/summary

Reproduction steps

Job file (if appropriate)

{
    "Stop": false,
    "Region": "moo",
    "Namespace": "default",
    "ID": "c-1",
    "ParentID": "",
    "Name": "c-1",
    "Type": "service",
    "Priority": 100,
    "AllAtOnce": false,
    "Datacenters": [
        "gc"
    ],
    "Constraints": null,
    "Affinities": null,
    "Spreads": null,
    "TaskGroups": [
        {
            "Name": "c-1",
            "Count": 1,
            "Update": null,
            "Migrate": {
                "MaxParallel": 1,
                "HealthCheck": "checks",
                "MinHealthyTime": 10000000000,
                "HealthyDeadline": 300000000000
            },
            "Constraints": [
                {
                    "LTarget": "${meta.manager}",
                    "RTarget": "true",
                    "Operand": "=="
                },
                {
                    "LTarget": "${attr.os.signals}",
                    "RTarget": "SIGINT",
                    "Operand": "set_contains"
                }
            ],
            "RestartPolicy": {
                "Attempts": 5,
                "Interval": 300000000000,
                "Delay": 15000000000,
                "Mode": "fail"
            },
            "Tasks": [
                {
                    "Name": "c-1",
                    "Driver": "docker",
                    "User": "",
                    "Config": {
                        "args": [],
                        "logging": [
                            {
                                "type": "json-file"
                            }
                        ],
                        "memory_soft_limit": true,
                        "image": "172.16.0.1:5000/c-1:v31_2",
                        "hostname": null,
                        "volumes": [
                            "/etc/moo:/etc/moo"
                        ],
                        "privileged": false,
                        "labels": [
                            {
                                "image.name": "c-1"
                            },
                            {
                                "service.name": "c-1"
                            }
                        ],
                        "ulimit": [],
                        "cap_add": [
                            "SYS_PTRACE"
                        ],
                        "network_mode": "host",
                        "entrypoint": null,
                        "command": null
                    },
                    "Env": {
                        "service_name": "c-1"
                    },
                    "Services": [
                        {
                            "Name": "c-1",
                            "PortLabel": "",
                            "AddressMode": "auto",
                            "Tags": [
                                "c-1"
                            ],
                            "CanaryTags": null,
                            "Checks": null
                        }
                    ],
                    "Vault": null,
                    "Templates": null,
                    "Constraints": null,
                    "Affinities": null,
                    "Resources": {
                        "CPU": 20,
                        "MemoryMB": 120,
                        "DiskMB": 0,
                        "IOPS": 0,
                        "Networks": null,
                        "Devices": null
                    },
                    "DispatchPayload": null,
                    "Meta": null,
                    "KillTimeout": 15000000000,
                    "LogConfig": {
                        "MaxFiles": 10,
                        "MaxFileSizeMB": 10
                    },
                    "Artifacts": null,
                    "Leader": false,
                    "ShutdownDelay": 0,
                    "KillSignal": "SIGINT"
                }
            ],
            "EphemeralDisk": {
                "Sticky": false,
                "SizeMB": 300,
                "Migrate": false
            },
            "Meta": null,
            "ReschedulePolicy": {
                "Attempts": 0,
                "Interval": 0,
                "Delay": 10000000000,
                "DelayFunction": "exponential",
                "MaxDelay": 300000000000,
                "Unlimited": true
            },
            "Affinities": null,
            "Spreads": null
        }
    ],
    "Update": {
        "Stagger": 0,
        "MaxParallel": 0,
        "HealthCheck": "",
        "MinHealthyTime": 0,
        "HealthyDeadline": 0,
        "ProgressDeadline": 0,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
    },
    "Periodic": null,
    "ParameterizedJob": null,
    "Dispatched": false,
    "Payload": null,
    "Meta": null,
    "VaultToken": "",
    "Status": "running",
    "StatusDescription": "",
    "Stable": false,
    "Version": 0,
    "SubmitTime": 1574023513552310002,
    "CreateIndex": 54,
    "ModifyIndex": 56,
    "JobModifyIndex": 54
}

Nomad Client logs (if appropriate)

/var/lib/nomad/ansible/setup_nomad/setup_nomad_client.yml:66 ------------------
+ exec nomad agent '--config=/etc/nomad/client.conf'
==> Loaded configuration from /etc/nomad/client.conf
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.16.0.1:3636
            Bind Addrs: HTTP: 0.0.0.0:3636
                Client: true
             Log Level: DEBUG
                Region: global (DC: gc)
                Server: false
               Version: 0.9.5-gc-1

==> Nomad agent started! Log data will stream in below:

    2019-11-17T22:58:17.605Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0

...

    2019-11-17T22:58:21.858Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.857Z
    2019-11-17T22:58:21.860Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=199
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
    2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
    2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker @module=docker_logger address=/tmp/plugin251495446 network=unix timestamp=2019-11-17T22:58:21.870Z
    2019-11-17T22:58:21.871Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.870Z
    2019-11-17T22:58:21.873Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
    2019-11-17T22:58:22.477Z [WARN ] client.alloc_runner.task_runner.task_hook: failed to reattach to logmon process: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 error="Reattachment process not found"
    2019-11-17T22:58:22.477Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 path=/usr/local/bin/nomad args="[/usr/local/bin/nomad logmon]"
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=212
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
    2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker address=/tmp/plugin933158382 network=unix @module=docker_logger timestamp=2019-11-17T22:58:21.886Z
    2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2

Nomad Server logs (if appropriate)

Don't have those

notnoop commented 4 years ago

Thanks @zackiles for reaching out. That's a bug we must address indeed.

In the example above, mind if you post nomad alloc status <alloc_id> as well? Were there more logs associated with that task? I was expecting some log lines like the following:

    2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 version=2
    2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon address=/var/folders/r6/346cfqyn76b_lx1nrcl5278c0000gp/T/plugin090993310 network=unix timestamp=2019-11-18T08:40:41.397-0500
    2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stdout.fifo timestamp=2019-11-18T08:40:41.398-0500
    2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stderr.fifo timestamp=2019-11-18T08:40:41.398-0500

One issue I note from the logs is that nomad children processes seem to have died, possibly by ansible in process of restarting task. When spinning up a task, nomad spins some auxiliary processes (docker_logger, logmon, executor) that expect to be running beyond nomad main process. If these processes die while nomad client is down, the recovery process may be hindered. That doesn't explain the bug I'm afraid, but I suspect it's a contributing cause.

zonnie commented 4 years ago

@notnoop - I will try and get the logs. Can u suggest some way of avoiding this ? Also, to my understanding one of the main reasons is that some peripheral processes don’t stick around while the client goes down ? Isn’t that similar to a complete VM reboot ? Don’t u recover from that ?

zonnie commented 4 years ago

@notnoop - I was unable to get the logs and the machine on which it happens is no more 😢 Sorry...

zonnie commented 4 years ago

Hi @notnoop - does this suffice ?

root@gc-machine1:~# gc-nomad alloc status 88d520b9
ID                  = 88d520b9
Eval ID             = 798b0dc4
Name                = c1.c1[0]
Node ID             = fcf73cfe
Node Name           = nomad-client-172.16.0.1
Job ID              = elasticsearch
Job Version         = 824638472240
Client Status       = pending
Client Description  = No tasks have started
Desired Status      = run
Desired Description = <none>
Created             = 12h27m ago
Modified            = 10h23m ago

Task "c1" is "pending"
Task Resources
CPU        Memory           Disk     Addresses
14/20 MHz  1.3 GiB/1.5 GiB  300 MiB  service_port: 172.16.0.1:9200

Task Events:
Started At     = 2019-11-24T20:33:21Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2019-11-24T22:37:15Z

Recent Events:
Time                  Type              Description
2019-11-24T22:37:15Z  Restarting        Task restarting in 18.375889653s
2019-11-24T22:37:15Z  Task hook failed  logmon: Unrecognized remote plugin message:

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.
2019-11-24T20:33:21Z  Started           Task started by client
2019-11-24T20:33:20Z  Task Setup        Building Task Directory
2019-11-24T20:33:19Z  Received          Task received by client

The job output is: (it's a reproduction of the above in a new environment)

root@machine-1:~# gc-nomad job status c1
ID            = c1
Name          = c1
Submit Date   = 2019-11-24T20:33:19Z
Type          = service
Priority      = 90
Datacenters   = gc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
c1  0       1         0        0       0         0

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
88d520b9  fcf73cfe  c1  0        run      pending  12h31m ago  10h27m ago
notnoop commented 4 years ago

@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot.

draoncc commented 4 years ago

I've encountered similar behavior around re-starting client nodes. I'm uncertain as to the steps I took to achieve this, as it happened some time during a larger docker version migration, and I've been unable to reproduce it.

The behavior is similar in that a task seems to be stuck in "starting", while having only allocations that are status "running" with desired "run".

> nomad job status -all-allocs -evals -verbose recommendation-svc-release-7
ID            = recommendation-svc-release-7
Name          = recommendation-svc-release-7
Submit Date   = 2020-04-14T12:00:08+02:00
Type          = service
Priority      = 50
Datacenters   = dev-eu-central-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
service     0       1         1        0       7         2

Evaluations
ID                                    Priority  Triggered By   Status    Placement Failures
a2d4f3d3-dab9-d959-2446-cb5f92427b6b  50        node-update    complete  false
999a2d88-10b3-7e68-c247-faec85620f84  50        queued-allocs  canceled  false
43e7c710-df17-fa5f-a399-78efd8fe397e  50        node-drain     complete  true
a82f73d9-9b05-f5b6-e1a8-fffd9c964559  50        node-drain     complete  false

Allocations
ID                                    Eval ID                               Node ID                               Node Name            Task Group  Version  Desired  Status    Created                    Modified
4a842a44-5302-0061-736d-9c6a9ba45327  a2d4f3d3-dab9-d959-2446-cb5f92427b6b  cc066f7d-f91c-fc99-ca02-3c2d521e84b5  i-0fec4e4281032614f  service     3        run      running   2020-04-27T11:00:37+02:00  2020-04-27T11:00:59+02:00
391e4ed9-bab1-b6a2-0d7e-87a164b4a382  a82f73d9-9b05-f5b6-e1a8-fffd9c964559  f6423ad0-1de8-509f-3a68-a6beaaff7f54  i-035a6f13756ba563e  service     3        stop     complete  2020-04-24T11:45:57+02:00  2020-04-27T10:58:42+02:00

In the above output, the job has been re-started twice with one queued alloc failing due to resource exhaustion after draining some nodes experimentally. The behavior - one "starting" in task group - began before then, and persisted after. The job is configured to only have one allocation running as well.

Due to the difference of all my allocations "running" instead of "pending" I'm not sure if my issue is related to the above at all.

Shackelford-Arden commented 3 years ago

We've found that we see this issue particularly on nodes where there are high counts of allocations on the node or when the node itself is fairly busy. Allocation count wise, a team mate of mine was able to replicate the issue when there were more than 24 allocations on a given node. Marking a node as ineligible and stopping the particular allocation to get it to move to a node that has fewer allocations has been our way of getting the allocation in a better place.

NicholasABriganti commented 3 years ago

@notnoop Any luck looking into the logmon: Unrecognized remote plugin message error? I have been running into this for the past ~6 weeks as well.

tgross commented 3 years ago

Hi folks! This issue has been open a while, so I wanted to drop by and add some context.

We've seen similar reports about the loggers from customers on Nomad 0.10.x and 0.11.x clusters. The reports we've received all are with the Docker driver, and there's some correlation with both client restarts and the number of allocations running on the client. When we're getting logs for Docker tasks, we're hitting the Docker API. This involves a whole lot of JSON serialization/deserialization, and I've experienced production issues at previous jobs when the Docker API gets hammered by logs, so that's a possible reproduction clue. I don't see any reason to suspect we wouldn't see this problem on the current version of Nomad as well.

The error is bubbling up from go-plugin/client.go:

    case line := <-linesCh:
        // Trim the line and split by "|" in order to get the parts of
        // the output.
        line = strings.TrimSpace(line)
        parts := strings.SplitN(line, "|", 6)
        if len(parts) < 4 {
            err = fmt.Errorf(
                "Unrecognized remote plugin message: %s\n\n"+
                    "This usually means that the plugin is either invalid or simply\n"+
                    "needs to be recompiled to support the latest protocol.", line)
            return
        }

Our colleague @davemay99 has pointed out that this may be the result of spurious empty lines, but we don't know why we'd be seeing that either. The plugin may be writing to stdout or dying before it got to write the socketport info. But the error message isn't all that useful either and could probably be re-worded to something like:

Empty remote plugin message received. This usually indicates a bug in the plugin.

Which wouldn't help Nomad users so much but would definitely make it more clear to us what could be going wrong.

I'm going to rename and re-tag this issue a bit so that we have a different focus on it. If folks who've come by have stuck "starting" tasks without seeing the logger errors, let's open up a new issue for that.

hilayc commented 2 years ago

@tgross Hi, I would like to add to your last message. I have encountered this issue too with the raw_exec driver on a Windows server with Nomad 1.0.8.

After about 50-60 allocation, we receive this error message and no more allocation can be scheduled on that specific client. You mentioned here that this could be linked to the number of allocations running on the client, so I tried manually removing some allocations and the failed ones eventually succeeded to run. When I tried bringing back those allocations I have stopped, they failed to run with the same error message. So I'm able to verify that the number of allocations is the root source of this error.

Please let me know if and how I can help you to investigate this.

tgross commented 2 years ago

Thanks @Nickleby9. Learning that it's not Docker-specific is actually a huge help because that eliminates a broad set of possible sources.

hilayc commented 2 years ago

@tgross Any news on this one? We are experiencing it in production environments too often. As a workaround we use the 'reschedule' stanza for making it try again on another server after it fails. Please keep us updated. Thanks

tgross commented 2 years ago

Hi @Nickleby9! Sorry, I don't have any updates on this issue. Let me brainstorm a bit with some of my colleagues on this and see what I can come back with.

If you have debug-level Nomad client logs for the clients that are impacted, that might help us out.

hilayc commented 2 years ago

@tgross I sent you a debug-level monitor from a client that is experiencing this issue to your email at your GitHub profile

tgross commented 2 years ago

Thanks @Nickleby9. For future reference there's a "one way" mailing list nomad-oss-debug@hashicorp.com that can receive logs so that they're going to HashiCorp engineers only, without going through our personal email addresses.

Looking at one of the log entries:

2022-05-09T08:34:57.221+0300 [ERROR] client.alloc_runner.task_runner.task_hook: failed to launch logmon process: alloc_id=aebc3479-04d6-6c3b-0246-1f1c6934cc30 task=redacted-name error= Unrecognized remote plugin message:
This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.

This definitely looks like we're getting a spurious empty line from go-plugin. Recent versions of Nomad have kept up pretty well with the library version. Which version of Nomad are you on @Nickleby9 ? That might help narrow things down.

hilayc commented 2 years ago

@tgross Thank you, I didn't know that, sorry for using your personal email. I'm using 1.2.6.

tgross commented 2 years ago

Ok so for Nomad 1.2.6 we're using go-plugin 1.4.3, and the error is coming from client.go#L688-L699. After a little more digging thru the go-plugin code I realized that this is the section of code that's parsing the go-plugin handshake.

So instead of getting a handshake from logmon, we're getting a blank line. Interestingly I see a closed issue https://github.com/hashicorp/go-plugin/issues/164 that dismisses this kind of thing as obviously a plugin author error. On Linux clients I could probably give you a bpftrace program that could find out where the spurious write to stdout is coming from. But this at least gives me a clue where to look next. I'll report back once I know more.

tgross commented 2 years ago

Ok, I wasn't able to find anywhere in either the Nomad logmon "server" code or the go-plugin code where we could be emitting a spurious unprintable character. So then I started digging into how we setup the logmon process and noticed that we have a different fifo pipe being set up for plugin communication on Windows (ref logmon_hook.go#L63-L70). Windows has a special file path required for named pipes like this.

Given that you're seeing the problem only once there are a large number of allocations running, I was worried we might have collisions between tasks when creating those files, but we could create ~100M of them before hitting the birthday paradox so I unless your cluster is very busy 😀 I think we're ok there. But as it turns out go-plugin doesn't actually use those pipes.

If we take a look at the go-plugin code we see on Windows we use a TCP listener (ref server.go#L494-L497). To set up that listener, we look for the environment variables PLUGIN_MIN_PORT and PLUGIN_MAX_PORT (ref server.go#L503-L504), and set them to 0 if unset.

But if they're both set to 0, the loop where we could potentially retry (ref server.go#L533-L539) ends up only trying one port, the weird "port 0" that tells the OS to provision a dynamic port. So far I haven't been able to uncover how Windows handles this specifically, other than that it's "mostly similar" to Unix (helpful!).

I yet don't have a firm mechanism in mind as to what could be going wrong, but there are two tests that might give us more data:

notnoop commented 2 years ago

@tgross Has been awhile since I looked into this issue. I had a hunch that the plugin crashes, such that we either get corrupted output or stdout closes before emitting the port info which nomad interprets as an empty line. I wasn't able to reproduce it so no proof.

Netlims commented 2 years ago

@tgross Ok I still had this issue this morning so first I saved the output of netstat -ab -p tcp to a file. Then I set PLUGIN_MIN_PORT to 40000 and PLUGIN_MAX_PORT to 50000 (we use the default values for max_dynamic_port). I restarted Nomad so it can read the new variables and after the restart, I no longer had this issue. It was a good sign at first, but then I tried deleting these variables and restarting Nomad again, but the issue was still gone. I ran the netstat command again and saved its output to a file, I sent both of them to the nomad-debug email address.

I have increased the min_heartbeat_ttl to 30s so restarting Nomad won't affect any running jobs and will give us a better understanding of the influence of the variables.

I will keep you updated once the issue will happen again and I will perform more tests.

tgross commented 2 years ago

Thanks for those logs @Netlims! Ok, there's a lot of ports in use on this machine, so I did a little data cleaning and then slurped the whole thing into sqlite for analysis.

In both logs, there are a large number of Nomad ports with a client named kubernetes (?! :grinning: ) all in a connected state, but another client named SAASQABE02 entirely in the listening state (waiting to establish a connection). All the connections from SAASQABE02 are from port 0, so that's the client itself?

select "Foreign Address", "State", count(*) from netstat where "Local Address" == "127.0.0.1" and "Application" == "[nomad.exe]" group by "Foreign Address", "State";
SAASQABE02,LISTENING,107
kubernetes,ESTABLISHED,214

select "Foreign Address", "Foreign Port", count(*) from netstat where "Foreign Address" == "SAASQABE02";
SAASQABE02,0,170

All the ports where both applications are bound are within 10000-10059 and 14000-14046, and the high ports in the 49000+ range have only the kubernetes as the client. The counts are roughly similar when the problem isn't happening, including the port ranges:

select "Foreign Address", "State", count(*) from netstat where "Local Address" == "127.0.0.1" and "Application" == "[nomad.exe]" group by "Foreign Address", "State";
SAASQABE02,LISTENING,121
kubernetes,CLOSE_WAIT,3
kubernetes,ESTABLISHED,244

There are a few "blocks" of port ranges being used with very few gaps, but there's plenty of room between those blocks, so that suggests a port collision is unlikely.

(As an aside, it looks like you're running Vault and Consul on the same host. They're both bound to the private IP and not localhost. If you're running Vault agent or Consul client on your Nomad clients, you might want to bind to localhost instead. Depending on your environment, we probably wouldn't recommend running the Vault/Consul server on the same host as a Nomad client either, although I suppose you could be running them as Nomad jobs?)

I will keep you updated once the issue will happen again and I will perform more tests.

If you were to watch/capture Process Monitor process activity, it might help verify if the logmon process is exiting with a known exit code without making its way over to the Nomad host.

hilayc commented 2 years ago

@tgross Indeed SAASQABE02 is the client itself, where the logs are taken from. Thank you for the tip regarding Consul and Vault. The issue still hasn't appeared yet, I'll capture Nomad and logmon with process monitor once it will happen again.

By the way, kubernetes is not something we are using, I though this is some internal naming done by Nomad, if you are not familiar with it I have no idea what it is :S

hilayc commented 2 years ago

@tgross We have this issue again. I sent a CSV file with the output of Process Monitor to the nomad-oss-debug email address. Thanks

tgross commented 2 years ago

Thanks @Nickleby9! The smoking gun is here:

08:16.7 nomad.exe 26400 Process Create C:\Windows\system32\conhost.exe SUCCESS "PID: 28964, Command line: \??\C:\Windows\system32\conhost.exe 0xffffffff -ForceV1" 08:16.7 nomad.exe 26400 CloseFile C:\Windows\System32\conhost.exe SUCCESS 08:16.7 nomad.exe 26400 Thread Exit SUCCESS "Thread ID: 34664, User Time: 0.0000000, Kernel Time: 0.0156250" 08:16.7 nomad.exe 26400 Process Exit SUCCESS "Exit Status: -1073741502, User Time: 0.0000000 seconds, Kernel Time: 0.0156250 seconds, Private Bytes: 2,756,608, Peak Private Bytes: 2,863,104, Working Set: 1,032,192, Peak Working Set: 1,335,296"

Logmon is trying to spawn a conhost process, and I want to dig into why on that (I think there's some library operation that's doing that). But according to the NTSTATUS Values docs, exit code -1073741502 (or 0xC0000142 in hex) is STATUS_DLL_INIT_FAILED:

{DLL Initialization Failed} Initialization of the dynamic link library %hs failed. The process is terminating abnormally.

Which honestly doesn't make a whole lot of sense to me, as if there was a linking error in the build I'd assume it would never work, instead of working for a while and then failing. Good ol' Stackoverflow has this question but that also reads to me as the sort of problem I'd expect to see all the time and now as we're seeing it here. I found a truly ancient Windows 2008 hotfix that fixes a problem similar to the symptoms you're describing, but obviously that doesn't apply here! 😬

So I don't quite have an answer yet, but we've definitely got a clue here for investigating. I'm going to look into why the conhost process is getting spawned and get back to you here once I have more info. Thanks for your patience!

hilayc commented 2 years ago

Hi @tgross! How is the investigation going? Anything more I can do to help? We are still having this issue every few days so if you need more logs/info please let me know.

Netlims commented 2 years ago

Hi @tgross, I noticed the other issue you mentioned. I can see that this one refers to Linux server with Docker driver, which is entirely different case than appears here. Based on the recent discovery, is there more information you need me to provide? This is happening on our QA environment almost twice a week and I intentionally not trying anything as a workaround (i.e increasing server count) so I will have more data to provide when you need it.

tgross commented 2 years ago

Hi folks, sorry, I don't have any new information on this one. We'll update when we've had a chance to dig in further.

moshemarciano commented 2 years ago

@tgross Any news with this? tracking this for a while.

We are seeing this crippling workload when the number of running processes is more than a few dozens on Windows boxes.

This is seriously causing us pain as we must fiddle with process allocation and nodes to get this back to normal every single time, sometimes with embarrassing results..

Not sure how widespread this problem is but based on this thread alone, it seems to be a very serious, possibly critical problem with the ability to scale using Nomad as a scheduler. Any chance this can get more focused attention from HC?

tgross commented 2 years ago

@moshemarciano as noted above, we'll update when we have more information. We've sunk a bunch of effort into trying to track it down and while there have been some interesting discoveries about specific OS behaviors, it doesn't seem to be isolated to a particular OS and so it's likely a problem in logmon proper and/or the go-plugin architecture. While we understand this is important to you, please understand we're always juggling a lot of priorities at once.

a-zagaevskiy commented 2 years ago

Hi everyone! We're also encontering this issue from time to time on Windows (Nomad v0.10.5). The head process nomad.exe is being run with a flag "CREATE_NO_WINDOW" and Windows creates a child conhost.exe. It's not fully clear but sometimes the mentioned conhost.exe turns out to be killed and that leads to the issue: no more allocation could be launched on the client (neither one could be stopped), logs are full with logmon "Unrecognized remote plugin" errors.

Netlims commented 1 year ago

Hi @tgross , Any news maybe? Is there anything else you need from us? We are experiencing this issue on a daily basis so we can give you any information you need. If this would help I can offer a remote session where I'll share my screen and you will be able to investigate it online.

ivasylenko commented 1 year ago

I see this problem on linux. Around the time when Nomad allocations fail, I have a gap (09:17:01-0:00:35) in system journal:

May 04 09:17:01 gc-management CRON[35912]: (root) CMD (   cd / && r......
...
May 04 10:00:35 gc-management sshd[38976]: Accepted publickey for root from 172.16.....

Errors (09:10:04-10:00:25)

2023-05-04 09:10:04,920 [1:MainThread:ERROR] cluster_interface._are_services_running.548: Service "remote-ssl-proxy" failed to run after 500000ms
...
2023-05-04 10:00:25,571 [1:MainThread:ERROR] cluster_interface._are_services_running.552: "influxdb" failed to run

Also I see lots of these on machine:

May  4 10:07:07 gc-management containerd[1253]: time="2023-05-04T10:07:07.439204884Z" level=info msg="shim disconnected" id=6497536a9f1fd3726d5d85a03de5f8572d80bf019dca1cf372cc8264f80fb422
May  4 10:07:07 gc-management containerd[1253]: time="2023-05-04T10:07:07.439433517Z" level=error msg="copy shim log" error="read /proc/self/fd/184: file already closed"
suikast42 commented 2 months ago

Had the same issue today in my VM with nomad 1.8

image

hilayc commented 2 months ago

The solution to our problem was to increased to amount of memory allocated for non-interactive processes on Windows. You can do it by modifying the following registry key: HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems\Windows

In the value, after "SharedSection", there are three numbers separated by a comma, the third number is responsible for the non-interactive memory space. Using 4096 solved this issue completely for us. This is how the value looks like after the change: C:\Windows\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,4096 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

tgross commented 2 months ago

@Nickleby9 do you know or have a link to docs on what that value is? We could add that to the install guide for Windows.

hilayc commented 2 months ago

@tgross sure. you can read about it here: https://learn.microsoft.com/en-us/troubleshoot/windows-server/performance/desktop-heap-limitation-out-of-memory

Nomad uses the non-interactive space (session 0), so that's the only value that needs to be modified.

tgross commented 2 months ago

Thanks @Nickleby9. There's an internal discussion around recommendations for Windows setups and general Windows performance issues, which I'll raise this in. There are obviously also some reports in this issue that this problem impacts Linux users (although maybe less often?). But that's helpful new information.