lf-edge / eve

EVE is Edge Virtualization Engine
https://www.lfedge.org/projects/eve/
Apache License 2.0
470 stars 159 forks source link

Fix Pkill confusion with proper pkill usage and zboot retry #4005

Closed christoph-zededa closed 3 months ago

christoph-zededa commented 3 months ago

Under certain cases (https://github.com/lf-edge/eve/pull/4005#issuecomment-2196792194 ) busybox pkill/pgrep reports a newly forked subprocess (before it had the change to exec) of zedbox. This leads that the subprocess gets killed with USR2 which leads to termination of the process if there is no USR2 handler - and there is none for some of the subprocesses of zedbox. In order to fix this:

  1. only pkill zedbox by using the pidfile of zedbox
  2. for zboot: restart the process if it got killed with USR{1,2}
christoph-zededa commented 3 months ago

@OhmSpectator can you have a look if it is okay the way I send USR2 to the memory handler?

OhmSpectator commented 3 months ago

@OhmSpectator can you have a look if it is okay the way I send USR2 to the memory handler?

Yeah, I think it should be fine. I'll test in now as well.

OhmSpectator commented 3 months ago

Unfortunately, the busybox version of pkill does not support the -o option, so this will not work until we add the propcps package to the base image.

rouming commented 3 months ago

Can this be solved by the zedbox pidfile? Pkill supports this by -F option

christoph-zededa commented 3 months ago

Can this be solved by the zedbox pidfile? Pkill supports this by -F option

unfortunately not all of the pkill implementations on the system support it:

linuxkit-525400123456:~# pkill -h
pkill: unrecognized option: h
BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary.

Usage: pkill [-l|-SIGNAL] [-xfvno] [-s SID|-P PPID|PATTERN]

Send signal to processes selected by regex PATTERN

    -l  List all signals
    -x  Match whole name (not substring)
    -f  Match against entire command line
    -s SID  Match session ID (0 for current)
    -P PPID Match parent process ID
    -v  Negate the match
    -n  Signal the newest process only
    -o  Signal the oldest process only
    -u EUID Match against effective UID
    -U UID  Match against UID
christoph-zededa commented 3 months ago

Unfortunately, the busybox version of pkill does not support the -o option, so this will not work until we add the propcps package to the base image.

should be fixed now - it was a confusion as the busybox pkill does not support this order of parameters ...

@OhmSpectator it should work now

eriknordmark commented 3 months ago

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

OhmSpectator commented 3 months ago

Ok, at least the latest version of the code works fine. Could we check this PR with the ztest run that triggered #4002 ?

christoph-zededa commented 3 months ago

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

I don't think we would need the golang equivalent, bash would be fine, wouldn't it?

/run/zedbox.pid is everywhere available where we want to run eve http-debug? I checked inital, debug and memory-monitor containers and it is available there.

christoph-zededa commented 3 months ago

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

I don't think we would need the golang equivalent, bash would be fine, wouldn't it?

/run/zedbox.pid is everywhere available where we want to run eve http-debug? I checked inital, debug and memory-monitor containers and it is available there.

I changed it ( memory-handler does not have a PID file).

christoph-zededa commented 3 months ago

Build failed because of:

2024-06-26T16:05:22.4560789Z docker: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.
milan-zededa commented 3 months ago

Build failed because of:

2024-06-26T16:05:22.4560789Z docker: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

We have new pulls available today. Take them before they run out :)

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 17.51%. Comparing base (3a02e3f) to head (d13f7fa). Report is 13 commits behind head on master.

:exclamation: Current head d13f7fa differs from pull request most recent head 8c30bf9

Please upload reports for the commit 8c30bf9 to get more accurate results.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #4005 +/- ## ======================================= Coverage 17.51% 17.51% ======================================= Files 3 3 Lines 805 805 ======================================= Hits 141 141 Misses 629 629 Partials 35 35 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

eriknordmark commented 3 months ago

FWIW I wrote a golang program which runs exec.Run while the sigusr2 handler is being invoked, and I don't ever see a failure (it has gone through tens of thousands of iterations today). So I wonder if there is something unique with iinvoking the zboot shell script??

christoph-zededa commented 3 months ago

FWIW I wrote a golang program which runs exec.Run while the sigusr2 handler is being invoked, and I don't ever see a failure (it has gone through tens of thousands of iterations today). So I wonder if there is something unique with iinvoking the zboot shell script??

I think I can somehow reproduce it:

package main

import (
    "os/exec"
    "runtime"
)

func main() {
    for i := 0; i < runtime.NumCPU(); i++ {
        go useCPU()
    }

    runLs()
}

func useCPU() {
    var i int
    for {
        i++
    }
}

func runLs() {
    for {
        cmd := exec.Cmd{
            Path: "/bin/ls",
        }

        _, err := cmd.Output()
        if err != nil {
            panic(err)
        }

        //fmt.Printf("%s\n", out)
        cmd.Run()

        cmd.Wait()
    }

}

run this as exec-ls. And in a different terminal run: while :; do busybox pgrep -l exec-tester; done | tee pids.

Then I get:

$ cat pids | sort | uniq
303320 ./exec-tester
316617 ./exec-tester
316626 ./exec-tester
316635 ./exec-tester
316638 ./exec-tester
316647 ./exec-tester
316824 ./exec-tester
316831 ./exec-tester
316852 ./exec-tester
316862 ./exec-tester
316894 ./exec-tester
316919 ./exec-tester
316928 ./exec-tester

It did not work with normal pgrep.

christoph-zededa commented 3 months ago

DCO is missing on the ubuntu commit, and that needs to be fixed if we are to build and run eden tests etc on the PR

Oh, I forgot to remove this one. That's just my environment when I try to run the go tests locally.

christoph-zededa commented 3 months ago

LGTM but please update the description and title to mention the retry of zboot commands.

Done

FWIW I've run this with the kernel setting to trigger on lower pressure without any issues since yesterday. I should check the logs if I got the retry messages from those runs.

I think you will not see those as this PR also changes how we invoke pkill; unless this failure does not come from /bin/eve.

eriknordmark commented 3 months ago

I think you will not see those as this PR also changes how we invoke pkill; unless this failure does not come from /bin/eve.

Yes, if the issue was the window between fork+change signal hander in child process+exec zboot, then I shouldn't see them. FWIW I don't see any "because of signal" in kibana. Will look again in a week and see if any test triggered it.