hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

exec driver leaks executor process after `StartTask` error #11958

Open tantra35 opened 2 years ago

tantra35 commented 2 years ago

Nomad version

Output from Nomad v1.1.10 (2f08fe230da05e1b179710ebe0e2582249599a4b+CHANGES)

Operating system and Environment details

Ubuntu 20.04

Issue

If we use unhallowed caps for exec driver after faill we got leeaking nomad exec processes

Reproduction steps

For example if we use net_raw caps that doens't allowed by default for exec driver

job testnetworknamespace
{
    region = "global"
    datacenters = ["test"]

    update
    {
        stagger = "1m"
        min_healthy_time = "1m"
        max_parallel = 1
        health_check="checks"
        healthy_deadline = "3m"
        progress_deadline = "6m"
        auto_revert = true
    }

    group testservicecheck
    {
        restart {
            attempts = 2
            delay    = "15s"
        }

        task testservicecheck
        {
            driver = "exec"
            leader=true

            config
            {
                cap_add = ["net_raw"]

                command = "sleep"
                args = ["6000"]
            }

            logs
            {
                max_files = 3
                max_file_size = 10
            }

            resources
            {
                memory = 300
                cpu = 100
            }
        }
    }
} 

after allocation on node fail with follow task state(which is absolutely expected behavior)

Recent Events:
Time                       Type            Description
2022-01-28T20:22:47+03:00  Killing         Sent interrupt. Waiting 5s before force killing
2022-01-28T20:22:47+03:00  Not Restarting  Error was unrecoverable
2022-01-28T20:22:47+03:00  Driver Failure  driver does not allow the following capabilities: net_raw
2022-01-28T20:22:45+03:00  Task Setup      Building Task Directory
2022-01-28T20:22:40+03:00  Received        Task received by client

on client node we got leaked nomad executor processes (here we demonstrate some output of ps axuf)

dnsmasq    33659  0.0  0.2  13932  2088 ?        S    19:16   0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e0
root       33756  0.6  5.6 1363452 56400 ?       Ssl  19:16   0:25 /opt/nomad/nomad agent -config=/etc/nomad/
root       34470  0.0  3.0 1287848 30340 ?       Ssl  19:23   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/0d35c0b9-5a61-adca-d070-413a1ee7ede6/testservicecheck/executor.out"
root       34893  0.0  3.0 1287848 30184 ?       Ssl  19:26   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/ca50587b-fa49-e422-2a7e-84f582147343/testservicecheck/executor.out"
root       38194  0.0  2.9 1509044 29924 ?       Ssl  20:05   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/006bd711-10c8-c230-9da1-b4182f826f8a/testservicecheck/executor.out"
root       38460  0.0  3.0 1287848 30892 ?       Ssl  20:07   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/6586763d-2fe8-9a89-a9e0-591d26461739/testservicecheck/executor.out"
root       38764  0.0  3.0 1287848 31008 ?       Ssl  20:09   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/52ccc789-89d0-23b4-d3ac-1408e6254ded/testservicecheck/executor.out"
root       40194  0.0  3.0 1361580 30492 ?       Ssl  20:22   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/c0d99d3f-3d47-dbb7-833c-054a4ef25721/testservicecheck/executor.out"
root       33760  0.2  2.6 175836 27048 ?        Ssl  19:16   0:12 /opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.22
lgfa29 commented 2 years ago

Thanks for raising this @tantra35, from a quick look at the information you provided (thanks for all the details!) I suspect we're missing some clean-up in an error code path.

tantra35 commented 2 years ago

@lgfa29 could you please tell is it possible expect a fix soon?

lgfa29 commented 2 years ago

We don't have a date for a fix. I placed this into our backlog for further triaging.

tgross commented 2 months ago

Doing some issue cleanup and wanted to confirm that this is still the case even after some improvements we've made recently to the exec driver's process cleanup. Using the following jobspec:

minimal jobspec ```hcl job "example" { group "sleep" { task "sleep" { driver = "exec" user = "ubuntu" config { command = "sleep" args = ["300"] cap_add = ["net_raw"] } } } } ```

We get task events like the following (as expected):

Recent Events:
Time                       Type            Description
2024-06-24T14:40:00-04:00  Not Restarting  Error was unrecoverable
2024-06-24T14:40:00-04:00  Driver Failure  driver does not allow the following capabilities: net_raw
2024-06-24T14:40:00-04:00  Task Setup      Building Task Directory
2024-06-24T14:40:00-04:00  Received        Task received by client

But after a couple of restarts we get leaked executor processes as reported above:

$ ps afx
...
   1997 ?        Ssl    0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
   2131 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/91bdfcf2-9972-5985-8cd7-62a5d566e193/sleep/executor.out
   2166 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/7599c82e-831f-7699-33f4-c6ab8da2655f/sleep/executor.out

I'm going to re-title this slightly and mark it for roadmapping. I'll also note from a quick look at the code that it almost certainly impacts the java driver and possibly the raw_exec driver as well, but haven't tested that.