monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 9 forks source link

node: concurrent map read/write inside hostentry sync #225

Closed fionera closed 1 year ago

fionera commented 1 year ago

After adding nodes fairly quickly we encountered a crash:

           hostsfile I Hosts entry:fatal error: concurrent map iteration and map write

goroutine 276 [running]:
runtime.throw({0x254e1d0?, 0x5?})
    GOROOT/src/runtime/panic.go:992 +0x71 fp=0xc0194ada50 sp=0xc0194ada20 pc=0x4386b1
runtime.mapiternext(0x22112c0?)
    GOROOT/src/runtime/map.go:871 +0x4eb fp=0xc0194adac0 sp=0xc0194ada50 pc=0x41006b
source.monogon.dev/metropolis/node/core/network/hostsfile.(*Service).Run(0xc001736050, {0x28b7fb0, 0xc00073a240})
    metropolis/node/core/network/hostsfile/hostsfile.go:190 +0xb0b fp=0xc0194ade48 sp=0xc0194adac0 pc=0x1cd2a8b
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run(0xc000140800, {0x28b7fb0, 0xc00073a240})
    metropolis/node/core/roleserve/worker_hostsfile.go:53 +0x2f1 fp=0xc0194adf40 sp=0xc0194ade48 pc=0x1e78af1
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run-fm({0x28b7fb0?, 0xc00073a240?})
    <autogenerated>:1 +0x39 fp=0xc0194adf68 sp=0xc0194adf40 pc=0x1e7df39
source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule.func1()
    metropolis/pkg/supervisor/supervisor_processor.go:249 +0x93 fp=0xc0194adfe0 sp=0xc0194adf68 pc=0x987b13
runtime.goexit()
    src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0194adfe8 sp=0xc0194adfe0 pc=0x46b061
created by source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule
    metropolis/pkg/supervisor/supervisor_processor.go:235 +0xea

After a reboot the node crashes while printing the hosts entries but without any error message.

Metropolis: this is /dev/ttyS1. Verbose node logs follow.

        panichandler E Failed to open core runtime log file: read-only file system
        panichandler W Continuing without persistent panic storage.
        panichandler I Panic console: /dev/tty0
        panichandler I Panic console: /dev/ttyS0
        panichandler I Panic console: /dev/ttyS1
                init I Starting Metropolis node init
                root I Board name: "[REDACTED]"
                root I No qemu fwcfg params.
                root I Retrieved node parameters from ESP
          k8s worker I Waiting for startup data...
     cplane launcher I Waiting for start data...
      k8s controller I Waiting for startup data...
           hostsfile I Waiting for curator connection...
          clusternet I Waiting for curator connection...
           rolefetch I Waiting for curator connection...
            nodemgmt I Waiting for cluster membership...
           heartbeat I Waiting for curator connection...
          net static I Configured interface "bond0"
          net static I Configured interface "enp65s0f0"
                root I Non-sealed configuration present. attempting to join cluster
                root I Joining an existing cluster.
                root I   Using TPM-secured configuration: false
                root I   Node Join public key: 1d83cf94261775598949149bd5c37b028dc313033680a0571d54a7f370cd9b0b
                root I   Directory:
                root I     Addresses:
q3k commented 1 year ago

I think this is two separate bugs?

One is the concurrent map access (which should've been caught by the race detector, which we're still not running automatically in tests...).

The second is a crash when joining with an empty node directory.

q3k commented 1 year ago

Tentative fix for the first issue: https://review.monogon.dev/c/monogon/+/1817

q3k commented 1 year ago

This implements explicitly failing join if we don't have a cluster directory: https://review.monogon.dev/c/monogon/+/1818

This isn't a fix, but it should make this failure mode clearer to cluster operators.

Not sure how much time we wanna spend investigating the silent aspect of the crash. I expect it might be a quiet panic due to us not catching them so early on in the boot process. And without such a handler, the panics go straight into /dev/stderr, which in our case is likely not /dev/ttyS1.

fionera commented 1 year ago

We found the reason for the Silent fail last night. Its an issue with the BMC not being fast enough. I made a small patch I still have to push that removes printing the whole directory on startup, to reduce the amount of logs written to serial

lorenz commented 1 year ago

Closed in https://review.monogon.dev/c/monogon/+/1817