Closed askfongjojo closed 1 year ago
I spent a few minutes looking at this today. Didn't find anything too useful, but I am writing down the things I noticed and looked into here:
In the beginning of the propolis log, we see that something made GET /instance
request before the instance was ensured:
04:32:59.990Z INFO propolis-server: Starting server...
04:32:59.991Z INFO propolis-server: listening
local_addr = [fd00:1122:3344:108::26]:12400
04:33:00.010Z INFO propolis-server: accepted connection
local_addr = [fd00:1122:3344:108::26]:12400
remote_addr = [fd00:1122:3344:108::1]:45646
04:33:00.011Z INFO propolis-server: request completed
error_message_external = Internal Server Error
error_message_internal = Server not initialized (no instance)
local_addr = [fd00:1122:3344:108::26]:12400
method = GET
remote_addr = [fd00:1122:3344:108::1]:45646
req_id = 909f0c8d-78f2-4467-a76d-b8702ee6fe31
response_code = 500
uri = /instance
The same remote address then did a PUT /instance
, so presumably that first request was from sled-agent:
04:33:00.012Z INFO propolis-server: Attempt to register [fd00:1122:3344:108::26]:0 with Nexus/Oximeter at [fd00:1122:3344:103::4]:12221
local_addr = [fd00:1122:3344:108::26]:12400
method = PUT
remote_addr = [fd00:1122:3344:108::1]:45646
req_id = a35d1d54-8057-4779-bb3d-9eb01774b16d
uri = /instance
Looking at the sled-agent source, this indeed matches the expected behavior in sled-agent today, which uses GET /instance
as a way to check if propolis-server is up. This doesn't indicate a problem, but since I was just a bit thrown by this, so I thought I would note it.
This issue:
[ 119.356621] systemd-journald[379]: Failed to send WATCHDOG=1 notification message: Connection refused
[ 229.356566] systemd-journald[379]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[ 116.498129] systemd-journald[376]: Failed to send WATCHDOG=1 notification message: Connection refused
[ 226.497975] systemd-journald[376]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
Just from the message alone, this suggests a networking issue, perhaps? One thing that might be good to grab if we see this issue again is the output of systemctl status systemd-timesyncd.service
(or whatever service that failed) to see if there's more useful information there.
In the propolis log, I noticed that the generation number used by the upstairs was a lot further ahead than the downstairs:
Jun 04 04:33:08.121 INFO Max found gen is 2
Jun 04 04:33:08.121 INFO Generation requested: 26 >= found:2
It's not entirely clear to me what this means -- per @leftwo, this could happen if the volume is "checked out" multiple times without the volume being written to. It's not totally clear to me from the log messages, but if this was the same volume used for the other instances in Angela's tests, then that could probably happen.
A cursory look at the systemd source indicates that the send() failure error logged is to a AF_UNIX socket (and given that the failure in #427 is a segfault), my handwavey suggestion at a networking issue is not likely. Will keep looking at this later.
I spent some time understanding the error logged here more via the systemd source. My understanding so far is:
Failed to send WATCHDOG=1 notification message:
...) comes from journald-server, as it tries to send(2)
to its notify socket, an AF_UNIX, SOCK_DGRAM socketNOTIFY_SOCKET
), which services can use to send messages to the service manager in PID 1 (see also: sd_notify)WATCHDOG=1
, which is just a ping:
Tells the service manager to update the watchdog timestamp. This is the keep-alive ping that services need to issue in regular intervals if WatchdogSec= is enabled for it.
ECONNREFUSED
, then ENOTCONN
thereafterSo, TL;DR: journald is trying to ping systemd via the well-known NOTIFY_SOCKET
on a regular interval, but it can't because something has gone awry with that socket.
I was able to reproduce some similar failures on dogfood in a reboot loop; as a next step I'd like to investigate enabling more verbose debugging output for systemd and see if we can get some clues to as to what happened to the notify socket; hopefully that will give us some more clues as to what is happening with the broader array of userspace guest issues we've seen with 22.04 here.
This is assumed to be one of the manifestations of the problem outlined in #427.
The issue is sporadic - it only happened on one of the 13 identical instances I created using the same jammy cloud image (but several other instances failed at different points of guest initialization - will have other tickets for each of the failure modes).
Here is the console log as seen in the rack2 Console UI (https://recovery.sys.rack2.eng.oxide.computer/projects/try/instances/sysbench-mysql-13/serial-console)
The propolis zone is located in sled BRM44220005 (rack2, cubby 9). I've copied the propolis log file to: catacomb.eng.oxide.computer:/data/staff/dogfood/jun-03/system-illumos-propolis-server_vm-74e29bfc-a137-4274-9fbc-1bbe7468f3ab.log