Closed eriknordmark closed 4 days ago
pillar
has a handler for USR2 and eve http-debug
can trigger it.
@OhmSpectator is planning to use it regularly for the memory tracer
@christoph-zededa do we have something currently which can trigger it? (Apart from the watchdog-repair script which doesn't get called until we are already in trouble and are going down?)
The memory monitor can trigger USR2 via calling eve http-debug
.
However, this should not be the case until the system is low on memory. Nevertheless, can we check if the /persist/memory-monitor/output directory contains the events.log file?
Also, the memory monitor was merged recently. @eriknordmark, are you running the tests on the version that already contains the tool? Don't you see it on a version that doesn't have it?
@christoph-zededa do we have something currently which can trigger it? (Apart from the watchdog-repair script which doesn't get called until we are already in trouble and are going down?)
I don't know anything that is "realistic"; in the code I find:
But I'm not quite sure what exactly happened there. Do we call SIGUSR2 too often, and Zedbox fails to handle it?
I'll try to run ztest on the PR that introduces the tool and will try on a version before it.
I've never seen this before. So two possibilities:
Running ztests in a loop with the current master I will continue to do that. But it might not reproduce that easily.
So it makes sense to look at some theory why the exec Run function would return such an error. Alternatively, hack up eve to generate a SIGUSR2 every 100 milliseconds starting when zedbox is started and see if that causes this failure.
I have two more crashes since last night using recent master. So I suspect something changed in when/how many SIGUSR we issue with the memory monitor code. Is there a periodic SIGUSR2 with memory monitor in place? And if so can be we make sure it doesn't fire early during the life of the zedbox process?
SIGUSR is only used in the memory monitor as a part of the handler for memory pressure events.
So, if it is caused by the tool, it means it's triggered too early. I can either adjust the thresholds or add some timeout for zedbox to pass the "early" stage. Now it just waits until the zedbox.pid
file appears.
Can we catch the system in this state? I want to check the contents of the /persist/memory-monitor/output directory. I mean, when we see the bug - we do not continue with the ztests on that node.
I see the Jenkins runs fail for all the PRs, but the error is unclear to me =/
The jenkins failures are not new (unless they have the same string about zboot partstate).
But @OhmSpectator can you check if the memory monitor can generate SIGUSR2 early in the boot of the device i.e., during the first few seconds/minutes of running pillar/zedbox?
I've tried to reproduce the scenario by triggering the memory monitor handler early (set threshold to a pretty low memory value). So, the handler is triggered pretty much after the tool start. But I do not see this behaviour... I'll try again.
@christoph-zededa, when I trigger the handler early, I see this in the logs:
2024-06-26T11:20:25.292818958Z;pillar;{"file":"/pillar/agentlog/agentlog.go:176","func":"github.com/lf-edge/eve/pkg/pillar/agentlog.dumpMemoryInfo","level":"error","msg":"handleSignals: Error opening file /persist/agentdebug/zedbox//sigusr2 with: open /persist/agentdebug/zedbox//sigusr2: no such file or directory","pid":1502,"source":"zedbox","time":"2024-06-26T11:20:25.292702455Z"}
Interesting, that /persist/agentdebug/zedbox
indeed does not exist.
It's with your fix for pkill. Let me try without.
Ah.. It looks like this message is regular, even when running the memory monitor handler is not that early.
I found some related logs in Kibana, yahoo. I will try to check them now.
Okaaay... The memory monitor tool is definitely part of the play here. I found some other related logs:
Jun 25, 2024 @ 08:53:15.577014570 Exiting the procfs monitor thread
Jun 25, 2024 @ 08:53:15.576925793 Failed to get the RSS of the process 1586
Jun 25, 2024 @ 08:53:15.576755514 Failed to open status file for process 1586
From what I see from the logs, it was an EVE pressure event. It does not look like it happened on an early stage. I also see a lot of memory-monitor errors like
Jun 25, 2024 @ 07:01:57.873448338 memory-monitor Invalid value: 822042624
Jun 25, 2024 @ 07:01:57.868862117 memory-monitor strtoul: No error information
Jun 25, 2024 @ 07:01:57.868846587 memory-monitor Invalid value: 362553344
Jun 25, 2024 @ 07:01:57.868830347 memory-monitor strtoul: No error information
Jun 25, 2024 @ 07:01:57.868777718 memory-monitor Invalid value: 822329344
I'll check it
The strtoul errors are fixed, but the signal problem is still there.. I attach the logs of the problem. logs.json
Actually, I see the memory pressure event during the download process... Maybe we really have several "zedbox" processes at this moment (one for download worker?)...
@OhmSpectator I haven't seen any crashes due to this for the last 24 hours. But there is probably still some latent issue there which can hit should we trigger the SIGUSR2. Do you know if you can reproduce this crash if you lower the kernel threshold? Or it needs to also run on the test machines I used?
@eriknordmark, the original issue I also saw on the other machines used by ztests started from Jenkins. So, to reproduce it, I can rebuild the tool with a lower pressure threshold (the latest version uses "critical", and we used "medium" at the beginning) and run the ztests. I also hope the #4005 should fix the issue.
Running some ztests using ztest I saw this failure once:
NEW lastRebootReason: Reboot from agent zedbox[1553] in partition IMGA at EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 at 2024-06-25T08:22:24.248724161Z: fatal: agent zedbox[1553]: zboot partstate IMGA: err signal: user defined signal 2
Watchdog report for IMGA EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 at 2024-06-25T08:23:26+00:00: 3 /run/zedbox.pid
lastRebootReason: NORMAL: baseos-update(IMGA) to EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 reboot at EVE version 0.0.0-master-d59b6121-kvm-amd64 at 2024-06-25T07:02:03.514776941Z
, NEW bootTime (1719303878.0), bootTime (1719298992.0), NEW bootTimeStr (2024-06-25 01:24:38.000000), bootTimeStr (2024-06-25 00:03:12.000000), cfgRebootCntr: 3344, statusRebootCntr: 4451 Stack Trace: ... gentlog.go:337 +0x4a github.com/lf-edge/eve/pkg/pillar/agentlog.initImpl.func1() /pillar/agentlog/agentlog.go:702 +0x29 github.com/sirupsen/logrus.runHandler(0x10?) /pillar/vendor/github.com/sirupsen/logrus/alt_exit.go:39 +0x3e github.com/sirupsen/logrus.runHandlers(...) /pillar/vendor/github.com/sirupsen/logrus/alt_exit.go:44 github.com/sirupsen/logrus.(Logger).Exit(0x3fbe620, 0xc000000001?) /pillar/vendor/github.com/sirupsen/logrus/logger.go:343 +0x4c github.com/sirupsen/logrus.(Logger).Fatalf(0xc0019e37b8?, {0x2b655e3?, 0x5?}, {0xc00163fd88?, 0x2?, 0x2?}) /pillar/vendor/github.com/sirupsen/logrus/logger.go:191 +0x4b github.com/sirupsen/logrus.Fatalf(...) /pillar/vendor/github.com/sirupsen/logrus/exported.go:224 github.com/lf-edge/eve/pkg/pillar/zboot.GetPartitionState({0xc0019e37b8, 0x4}) /pillar/zboot/zboot.go:201 +0x145 github.com/lf-edge/eve/pkg/pillar/zboot.IsPartitionState({0xc0019e37b8, 0x4}, {0x2b2af24, 0xa}) /pillar/zboot/zboot.go:214 +0x4b github.com/lf-edge/eve/pkg/pillar/zboot.IsCurrentPartitionStateInProgress() /pillar/zboot/zboot.go:303 +0x2a github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.traceNextReq(0xc0004df200, {0xc001ca9f04?, 0xe?, 0x4086a40?}) /pillar/cmd/zedagent/handleconfig.go:1302 +0x11c github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.traceNextConfigReq(...) /pillar/cmd/zedagent/handleconfig.go:1311 github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.configTimerTask(0xc00059e000, 0x0?) /pillar/cmd/zedagent/handleconfig.go:447 +0x4dd created by github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.Run /pillar/cmd/zedagent/zedagent.go:510 +0x169b
Have we added some periodic SIGUSR2 calls to track usage? Seems like we need to block USR2 (and USR1 if we use that periodically) across calls which can be interrupted.