eriknordmark commented 1 week ago

Running some ztests using ztest I saw this failure once:

NEW lastRebootReason: Reboot from agent zedbox[1553] in partition IMGA at EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 at 2024-06-25T08:22:24.248724161Z: fatal: agent zedbox[1553]: zboot partstate IMGA: err signal: user defined signal 2

Watchdog report for IMGA EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 at 2024-06-25T08:23:26+00:00: 3 /run/zedbox.pid

lastRebootReason: NORMAL: baseos-update(IMGA) to EVE version 0.0.0-ni-mtu-79e3b1c6-dirty-2024-06-24.20.42-kvm-amd64 reboot at EVE version 0.0.0-master-d59b6121-kvm-amd64 at 2024-06-25T07:02:03.514776941Z

, NEW bootTime (1719303878.0), bootTime (1719298992.0), NEW bootTimeStr (2024-06-25 01:24:38.000000), bootTimeStr (2024-06-25 00:03:12.000000), cfgRebootCntr: 3344, statusRebootCntr: 4451 Stack Trace: ... gentlog.go:337 +0x4a github.com/lf-edge/eve/pkg/pillar/agentlog.initImpl.func1() /pillar/agentlog/agentlog.go:702 +0x29 github.com/sirupsen/logrus.runHandler(0x10?) /pillar/vendor/github.com/sirupsen/logrus/alt_exit.go:39 +0x3e github.com/sirupsen/logrus.runHandlers(...) /pillar/vendor/github.com/sirupsen/logrus/alt_exit.go:44 github.com/sirupsen/logrus.(Logger).Exit(0x3fbe620, 0xc000000001?) /pillar/vendor/github.com/sirupsen/logrus/logger.go:343 +0x4c github.com/sirupsen/logrus.(Logger).Fatalf(0xc0019e37b8?, {0x2b655e3?, 0x5?}, {0xc00163fd88?, 0x2?, 0x2?}) /pillar/vendor/github.com/sirupsen/logrus/logger.go:191 +0x4b github.com/sirupsen/logrus.Fatalf(...) /pillar/vendor/github.com/sirupsen/logrus/exported.go:224 github.com/lf-edge/eve/pkg/pillar/zboot.GetPartitionState({0xc0019e37b8, 0x4}) /pillar/zboot/zboot.go:201 +0x145 github.com/lf-edge/eve/pkg/pillar/zboot.IsPartitionState({0xc0019e37b8, 0x4}, {0x2b2af24, 0xa}) /pillar/zboot/zboot.go:214 +0x4b github.com/lf-edge/eve/pkg/pillar/zboot.IsCurrentPartitionStateInProgress() /pillar/zboot/zboot.go:303 +0x2a github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.traceNextReq(0xc0004df200, {0xc001ca9f04?, 0xe?, 0x4086a40?}) /pillar/cmd/zedagent/handleconfig.go:1302 +0x11c github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.traceNextConfigReq(...) /pillar/cmd/zedagent/handleconfig.go:1311 github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.configTimerTask(0xc00059e000, 0x0?) /pillar/cmd/zedagent/handleconfig.go:447 +0x4dd created by github.com/lf-edge/eve/pkg/pillar/cmd/zedagent.Run /pillar/cmd/zedagent/zedagent.go:510 +0x169b

Have we added some periodic SIGUSR2 calls to track usage? Seems like we need to block USR2 (and USR1 if we use that periodically) across calls which can be interrupted.

christoph-zededa commented 1 week ago

pillar has a handler for USR2 and eve http-debug can trigger it.

@OhmSpectator is planning to use it regularly for the memory tracer

eriknordmark commented 1 week ago

@christoph-zededa do we have something currently which can trigger it? (Apart from the watchdog-repair script which doesn't get called until we are already in trouble and are going down?)

OhmSpectator commented 1 week ago

The memory monitor can trigger USR2 via calling eve http-debug. However, this should not be the case until the system is low on memory. Nevertheless, can we check if the /persist/memory-monitor/output directory contains the events.log file?

OhmSpectator commented 1 week ago

Also, the memory monitor was merged recently. @eriknordmark, are you running the tests on the version that already contains the tool? Don't you see it on a version that doesn't have it?

christoph-zededa commented 1 week ago

@christoph-zededa do we have something currently which can trigger it? (Apart from the watchdog-repair script which doesn't get called until we are already in trouble and are going down?)

I don't know anything that is "realistic"; in the code I find:

collect-info
edgeview
memory-handler

OhmSpectator commented 1 week ago

But I'm not quite sure what exactly happened there. Do we call SIGUSR2 too often, and Zedbox fails to handle it?

OhmSpectator commented 1 week ago

I'll try to run ztest on the PR that introduces the tool and will try on a version before it.

eriknordmark commented 1 week ago

I've never seen this before. So two possibilities:

it is an old issue and it just happened now
the memory moring tool causes some additional SIGUSR2 by default triggering either a latent bug or causing an earlier SIGUSR2 which the code can't handle

Running ztests in a loop with the current master I will continue to do that. But it might not reproduce that easily.

So it makes sense to look at some theory why the exec Run function would return such an error. Alternatively, hack up eve to generate a SIGUSR2 every 100 milliseconds starting when zedbox is started and see if that causes this failure.

eriknordmark commented 1 week ago

I have two more crashes since last night using recent master. So I suspect something changed in when/how many SIGUSR we issue with the memory monitor code. Is there a periodic SIGUSR2 with memory monitor in place? And if so can be we make sure it doesn't fire early during the life of the zedbox process?

OhmSpectator commented 1 week ago

SIGUSR is only used in the memory monitor as a part of the handler for memory pressure events. So, if it is caused by the tool, it means it's triggered too early. I can either adjust the thresholds or add some timeout for zedbox to pass the "early" stage. Now it just waits until the zedbox.pid file appears.

OhmSpectator commented 1 week ago

Can we catch the system in this state? I want to check the contents of the /persist/memory-monitor/output directory. I mean, when we see the bug - we do not continue with the ztests on that node.

OhmSpectator commented 1 week ago

I see the Jenkins runs fail for all the PRs, but the error is unclear to me =/

3984: https://jenkins01.devt.zededa.net:8080/view/EVE/job/Eve-PR-SmokeTest-Ring0/52775/

3981: https://jenkins01.devt.zededa.net:8080/view/EVE/job/Eve-PR-SmokeTest-Ring0/52776/

eriknordmark commented 1 week ago

The jenkins failures are not new (unless they have the same string about zboot partstate).

eriknordmark commented 1 week ago

But @OhmSpectator can you check if the memory monitor can generate SIGUSR2 early in the boot of the device i.e., during the first few seconds/minutes of running pillar/zedbox?

OhmSpectator commented 1 week ago

I've tried to reproduce the scenario by triggering the memory monitor handler early (set threshold to a pretty low memory value). So, the handler is triggered pretty much after the tool start. But I do not see this behaviour... I'll try again.

OhmSpectator commented 1 week ago

@christoph-zededa, when I trigger the handler early, I see this in the logs:

2024-06-26T11:20:25.292818958Z;pillar;{"file":"/pillar/agentlog/agentlog.go:176","func":"github.com/lf-edge/eve/pkg/pillar/agentlog.dumpMemoryInfo","level":"error","msg":"handleSignals: Error opening file /persist/agentdebug/zedbox//sigusr2 with: open /persist/agentdebug/zedbox//sigusr2: no such file or directory","pid":1502,"source":"zedbox","time":"2024-06-26T11:20:25.292702455Z"}

Interesting, that /persist/agentdebug/zedbox indeed does not exist. It's with your fix for pkill. Let me try without.

OhmSpectator commented 1 week ago

Ah.. It looks like this message is regular, even when running the memory monitor handler is not that early.

OhmSpectator commented 1 week ago

I found some related logs in Kibana, yahoo. I will try to check them now.

OhmSpectator commented 1 week ago

Okaaay... The memory monitor tool is definitely part of the play here. I found some other related logs:

Jun 25, 2024 @ 08:53:15.577014570   Exiting the procfs monitor thread
Jun 25, 2024 @ 08:53:15.576925793   Failed to get the RSS of the process 1586
Jun 25, 2024 @ 08:53:15.576755514   Failed to open status file for process 1586

OhmSpectator commented 1 week ago

From what I see from the logs, it was an EVE pressure event. It does not look like it happened on an early stage. I also see a lot of memory-monitor errors like

Jun 25, 2024 @ 07:01:57.873448338   memory-monitor  Invalid value: 822042624
Jun 25, 2024 @ 07:01:57.868862117   memory-monitor  strtoul: No error information
Jun 25, 2024 @ 07:01:57.868846587   memory-monitor  Invalid value: 362553344
Jun 25, 2024 @ 07:01:57.868830347   memory-monitor  strtoul: No error information
Jun 25, 2024 @ 07:01:57.868777718   memory-monitor  Invalid value: 822329344

I'll check it

OhmSpectator commented 1 week ago

The strtoul errors are fixed, but the signal problem is still there.. I attach the logs of the problem. logs.json

OhmSpectator commented 1 week ago

Actually, I see the memory pressure event during the download process... Maybe we really have several "zedbox" processes at this moment (one for download worker?)...

eriknordmark commented 1 week ago

@OhmSpectator I haven't seen any crashes due to this for the last 24 hours. But there is probably still some latent issue there which can hit should we trigger the SIGUSR2. Do you know if you can reproduce this crash if you lower the kernel threshold? Or it needs to also run on the test machines I used?

OhmSpectator commented 4 days ago

@eriknordmark, the original issue I also saw on the other machines used by ztests started from Jenkins. So, to reproduce it, I can rebuild the tool with a lower pressure threshold (the latest version uses "critical", and we used "medium" at the beginning) and run the ztests. I also hope the #4005 should fix the issue.

lf-edge / eve

bug: fatal: agent zedbox[1553]: zboot partstate IMGA: err signal: user defined signal 2 #4002

3984: https://jenkins01.devt.zededa.net:8080/view/EVE/job/Eve-PR-SmokeTest-Ring0/52775/

3981: https://jenkins01.devt.zededa.net:8080/view/EVE/job/Eve-PR-SmokeTest-Ring0/52776/