Closed brendanhoar closed 2 years ago
This is a permission-denied error from xenstore. What was in the log before this?
As I said, the exception may be a red herring (perhaps triggered by a manual shutdown request after I wasn't able to get any upstream vpn VMs to connect to this mirage-firewall VM).
...
[admin@dom0 console]$ grep -B 20 -A 20 -i exception guest-sys-mirage-vpn-to-net.log
2020-06-05 14:13:22 -00:00: INF [qubes.rexec] waiting for client...
2020-06-05 14:13:22 -00:00: INF [qubes.gui] waiting for client...
2020-06-05 14:13:22 -00:00: INF [qubes.db] connecting to server...
2020-06-05 14:13:22 -00:00: INF [qubes.db] connected
2020-06-05 14:13:22 -00:00: INF [qubes.rexec] client connected, using protocol version 2
2020-06-05 14:13:22 -00:00: INF [unikernel] QubesDB and qrexec agents connected in 0.027 s
2020-06-05 14:13:22 -00:00: INF [dao] Got network configuration from QubesDB:
NetVM IP on uplink network: 10.137.0.5
Our IP on uplink network: 10.137.0.50
Our IP on client networks: 10.137.0.50
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] connect 0
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] create: id=0 domid=1
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] sg:true gso_tcpv4:true rx_copy:true rx_flip:false smart_poll:false
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] MAC: 00:16:3e:5e:6c:00
2020-06-05 14:13:23 -00:00: INF [qubes.db] got update: "/qubes-keyboard" = "xkb_keymap {\n\txkb_keycodes { include \"evdev+aliases(qwerty)\"\t};\n\txkb_types { include \"complete\"\t};\n\txkb_compat { include \"complete\"\t};\n\txkb_symbols { include \"pc+us+inet(evdev)\"\t};\n\txkb_geometry { include \"pc(pc105)\"\t};\n};"
2020-06-05 14:13:23 -00:00: INF [qubes.gui] client connected (screen size: 1920x1080 depth: 24 mem: 8101x)
2020-06-05 14:13:23 -00:00: INF [unikernel] GUI agent connected
2020-06-05 14:13:23 -00:00: WRN [command] << Unknown command "QUBESRPC qubes.SetMonitorLayout dom0"
2020-06-05 14:30:10 -00:00: INF [qubes.db] got update: "/qubes-service/qubes-update-check" = "0"
2020-06-05 14:30:12 -00:00: INF [qubes.db] got update: "/qubes-service/qubes-update-check" = "0"
Fatal error: exception Xs_protocol.Error("EACCES")
Raised at file "core/xs_protocol.ml", line 685, characters 13-28
Called from file "client_lwt/xs_client_lwt.ml", line 318, characters 13-50
Called from file "src/core/lwt.ml", line 1867, characters 23-26
Re-raised at file "src/core/lwt.ml", line 3027, characters 20-29
Called from file "lib/main.ml", line 35, characters 10-20
Called from file "main.ml" (inlined), line 5, characters 12-25
Called from file "main.ml", line 80, characters 5-10
Mirage exiting with status 2
Do_exit called!
.[32;1mMirageOS booting....[0m
Initialising timer interface
Initialising console ... done.
gnttab_stubs.c: initialised mini-os gntmap
2020-06-05 14:32:22 -00:00: INF [qubes.rexec] waiting for client...
2020-06-05 14:32:22 -00:00: INF [qubes.gui] waiting for client...
2020-06-05 14:32:22 -00:00: INF [qubes.db] connecting to server...
2020-06-05 14:32:22 -00:00: INF [qubes.db] connected
2020-06-05 14:32:22 -00:00: INF [qubes.rexec] client connected, using protocol version 2
2020-06-05 14:32:22 -00:00: INF [unikernel] QubesDB and qrexec agents connected in 0.030 s
2020-06-05 14:32:22 -00:00: INF [dao] Got network configuration from QubesDB:
...
Normal start-up looks like this:
2020-06-04 10:46:28 -00:00: INF [net-xen frontend] MAC: 00:16:3e:5e:6c:00
2020-06-04 10:46:28 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.10/visible-ip" = "10.137.0.10"
2020-06-04 10:46:28 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.10/visible-gateway" = "10.137.0.16"
2020-06-04 10:46:28 -00:00: INF [ethernet] Connected Ethernet interface 00:16:3e:5e:6c:00
2020-06-04 10:46:28 -00:00: INF [ARP] Sending gratuitous ARP for 10.137.0.16 (00:16:3e:5e:6c:00)
uplink.ml
does:
Netif.connect "0" >>= fun net ->
Eth.connect net >>= fun eth ->
Arp.connect eth >>= fun arp ->
And Eth.connect
prints Connected Ethernet interface
as its first (and only) action (https://github.com/mirage/ethernet/blob/master/src/ethernet.ml#L88). Your log doesn't show that, so it looks like Netif.connect
never returned. The MAC
log entry is printed here:
https://github.com/mirage/mirage-net-xen/blob/master/lib/frontend.ml#L102
After that, it writes the network configuration to Xenstore in dom0 and then waits for sys-net to connect. When it's in this stuck state, you should be able to see the state of Xenstore in dom0 using xenstore ls
. That will tell us whether the problem is with the firewall or with sys-net.
For example, if xl list
shows that sys-net
is dom 1 and mirage-firewall
is dom 8, then the settings can be viewed with these two commands:
xenstore ls /local/domain/1/backend/vif/8
xenstore ls /local/domain/8/device/vif
Thanks.
As it's on my daily driver, I reverted to one of the 0.6 versions for now. Next time I reboot (and have some time*) I'll set the kernels to 0.7, see if I can repeat the issue, and pull the data you're asking for.
Brendan
* Spouse and I WFH, small house w/ toddler TFH (toddling from home), packing for a move/buying a house, federal civil lawsuit, etc...bit busy. :)
Any update on this @brendanhoar? We're as well preparing a next release, so you may want to try whether this fixes your issue.
As I plan to migrate to Qubes R4.1 within the next week or two, I will pull down the new 0.8 mirage-firewall version during that effort and test to determine if the issues have resolved. [If not I will pull some more log info.]
Thanks. :)
B
Thanks for your reply. If you happen to encounter issues with 0.8.2, please open a new report with updated logs. Thanks.
A very similar issue is reported by Viable2381 on https://forum.qubes-os.org/t/libxenlight-failed-unable-to-add-vif-devices/16958/1 but I am not yet able to reproduce it. @brendanhoar would you mind to test again with your 4.1 setup?
I wonder if, like we did in https://github.com/mirage/mirage-net-xen/commit/52d0109fa9117271551b7757d2d2413b57591ebb, it should be safer to wait for backend to be in initialised state (e.g: have something in mirage-net-xen/lib/xenstore.ml
, plus a call in mirage-net-xen/lib/frontend.ml
just before C.write_frontend_configuration
):
let wait_until_backend_initialised conf =
Xs.make () >>= fun xsc ->
Xs.wait xsc (fun h ->
Lwt.catch
(fun () ->
Xs.read h (conf.backend / "state") >>= fun state ->
let open Xen_os.Device_state in
match of_string state with
| Initialised -> return ()
| Closed (* XXX: stop waiting? *)
| Reconfigured (* XXX: stop waiting? *)
| _ -> fail Xs_protocol.Eagain
)
(function
| Xs_protocol.Enoent _ -> fail Xs_protocol.Eagain
| ex -> fail ex
)
)
This one stumped me and I haven't dug deep yet, but pretty sure it's related to something going on in mirage firewall...
After switching my sys-net from fedora-29 to fedora-32, the sys-net may take a bit longer to start up networking. Only after switching to fedora-32 did I start seeing some problems that I traced to mirage firewall.
If I have invoked sys-net (which I do not autostartup) by starting the mirage firewall 0.7 first, sometimes sys-net and the mirage firewall appear to be started OK, but...any VM attempting to use the mirage firewall as a provides-network VM will fail to start with the following types of errors: "libxenlight failed to create new domain"
I think what is happening is that the mirage firewall isn't seeing the upstream network in a timely fashion and ends up broken and/or crashed, which then causes the downstream VM to fail to start due to lack of network attachment?
Experiments so far (not proof) indicate that if I start sys-net/fedora-32, wait a while, then start the mirage firewall, everything operates correctly.
Also, I see some exceptions in the VM console output, but they might be red herrings:
Brendan