Closed fionera closed 1 year ago
I think this is two separate bugs?
One is the concurrent map access (which should've been caught by the race detector, which we're still not running automatically in tests...).
The second is a crash when joining with an empty node directory.
Tentative fix for the first issue: https://review.monogon.dev/c/monogon/+/1817
This implements explicitly failing join if we don't have a cluster directory: https://review.monogon.dev/c/monogon/+/1818
This isn't a fix, but it should make this failure mode clearer to cluster operators.
Not sure how much time we wanna spend investigating the silent aspect of the crash. I expect it might be a quiet panic due to us not catching them so early on in the boot process. And without such a handler, the panics go straight into /dev/stderr, which in our case is likely not /dev/ttyS1.
We found the reason for the Silent fail last night. Its an issue with the BMC not being fast enough. I made a small patch I still have to push that removes printing the whole directory on startup, to reduce the amount of logs written to serial
After adding nodes fairly quickly we encountered a crash:
After a reboot the node crashes while printing the hosts entries but without any error message.