mirage / qubes-mirage-firewall

A Mirage firewall VM for QubesOS
BSD 2-Clause "Simplified" License
211 stars 28 forks source link

Firewall may refuse to create upstream VM's VIF if downstream "provides network VM" does not present VIF in time? #107

Closed brendanhoar closed 2 years ago

brendanhoar commented 4 years ago

This one stumped me and I haven't dug deep yet, but pretty sure it's related to something going on in mirage firewall...

After switching my sys-net from fedora-29 to fedora-32, the sys-net may take a bit longer to start up networking. Only after switching to fedora-32 did I start seeing some problems that I traced to mirage firewall.

If I have invoked sys-net (which I do not autostartup) by starting the mirage firewall 0.7 first, sometimes sys-net and the mirage firewall appear to be started OK, but...any VM attempting to use the mirage firewall as a provides-network VM will fail to start with the following types of errors: "libxenlight failed to create new domain "

I think what is happening is that the mirage firewall isn't seeing the upstream network in a timely fashion and ends up broken and/or crashed, which then causes the downstream VM to fail to start due to lack of network attachment?

Experiments so far (not proof) indicate that if I start sys-net/fedora-32, wait a while, then start the mirage firewall, everything operates correctly.

Also, I see some exceptions in the VM console output, but they might be red herrings:

Fatal error: exception Xs_protocol.Error("EACCES")
Raised at file "core/xs_protocol.ml", line 685, characters 13-28
Called from file "client_lwt/xs_client_lwt.ml", line 318, characters 13-50
Called from file "src/core/lwt.ml", line 1867, characters 23-26
Re-raised at file "src/core/lwt.ml", line 3027, characters 20-29
Called from file "lib/main.ml", line 35, characters 10-20
Called from file "main.ml" (inlined), line 5, characters 12-25
Called from file "main.ml", line 80, characters 5-10
Mirage exiting with status 2
Do_exit called!

Brendan

talex5 commented 4 years ago

This is a permission-denied error from xenstore. What was in the log before this?

brendanhoar commented 4 years ago

As I said, the exception may be a red herring (perhaps triggered by a manual shutdown request after I wasn't able to get any upstream vpn VMs to connect to this mirage-firewall VM).

...
[admin@dom0 console]$ grep -B 20 -A 20 -i exception guest-sys-mirage-vpn-to-net.log 
2020-06-05 14:13:22 -00:00: INF [qubes.rexec] waiting for client...
2020-06-05 14:13:22 -00:00: INF [qubes.gui] waiting for client...
2020-06-05 14:13:22 -00:00: INF [qubes.db] connecting to server...
2020-06-05 14:13:22 -00:00: INF [qubes.db] connected
2020-06-05 14:13:22 -00:00: INF [qubes.rexec] client connected, using protocol version 2
2020-06-05 14:13:22 -00:00: INF [unikernel] QubesDB and qrexec agents connected in 0.027 s
2020-06-05 14:13:22 -00:00: INF [dao] Got network configuration from QubesDB:
            NetVM IP on uplink network: 10.137.0.5
            Our IP on uplink network:   10.137.0.50
            Our IP on client networks:  10.137.0.50
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] connect 0
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] create: id=0 domid=1
2020-06-05 14:13:22 -00:00: INF [net-xen frontend]  sg:true gso_tcpv4:true rx_copy:true rx_flip:false smart_poll:false
2020-06-05 14:13:22 -00:00: INF [net-xen frontend] MAC: 00:16:3e:5e:6c:00
2020-06-05 14:13:23 -00:00: INF [qubes.db] got update: "/qubes-keyboard" = "xkb_keymap {\n\txkb_keycodes  { include \"evdev+aliases(qwerty)\"\t};\n\txkb_types     { include \"complete\"\t};\n\txkb_compat    { include \"complete\"\t};\n\txkb_symbols   { include \"pc+us+inet(evdev)\"\t};\n\txkb_geometry  { include \"pc(pc105)\"\t};\n};"
2020-06-05 14:13:23 -00:00: INF [qubes.gui] client connected (screen size: 1920x1080 depth: 24 mem: 8101x)
2020-06-05 14:13:23 -00:00: INF [unikernel] GUI agent connected
2020-06-05 14:13:23 -00:00: WRN [command] << Unknown command "QUBESRPC qubes.SetMonitorLayout dom0"
2020-06-05 14:30:10 -00:00: INF [qubes.db] got update: "/qubes-service/qubes-update-check" = "0"
2020-06-05 14:30:12 -00:00: INF [qubes.db] got update: "/qubes-service/qubes-update-check" = "0"
Fatal error: exception Xs_protocol.Error("EACCES")
Raised at file "core/xs_protocol.ml", line 685, characters 13-28
Called from file "client_lwt/xs_client_lwt.ml", line 318, characters 13-50
Called from file "src/core/lwt.ml", line 1867, characters 23-26
Re-raised at file "src/core/lwt.ml", line 3027, characters 20-29
Called from file "lib/main.ml", line 35, characters 10-20
Called from file "main.ml" (inlined), line 5, characters 12-25
Called from file "main.ml", line 80, characters 5-10
Mirage exiting with status 2
Do_exit called!
.[32;1mMirageOS booting....[0m
Initialising timer interface
Initialising console ... done.
gnttab_stubs.c: initialised mini-os gntmap
2020-06-05 14:32:22 -00:00: INF [qubes.rexec] waiting for client...
2020-06-05 14:32:22 -00:00: INF [qubes.gui] waiting for client...
2020-06-05 14:32:22 -00:00: INF [qubes.db] connecting to server...
2020-06-05 14:32:22 -00:00: INF [qubes.db] connected
2020-06-05 14:32:22 -00:00: INF [qubes.rexec] client connected, using protocol version 2
2020-06-05 14:32:22 -00:00: INF [unikernel] QubesDB and qrexec agents connected in 0.030 s
2020-06-05 14:32:22 -00:00: INF [dao] Got network configuration from QubesDB:
...
talex5 commented 4 years ago

Normal start-up looks like this:

2020-06-04 10:46:28 -00:00: INF [net-xen frontend] MAC: 00:16:3e:5e:6c:00
2020-06-04 10:46:28 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.10/visible-ip" = "10.137.0.10"
2020-06-04 10:46:28 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.10/visible-gateway" = "10.137.0.16"
2020-06-04 10:46:28 -00:00: INF [ethernet] Connected Ethernet interface 00:16:3e:5e:6c:00
2020-06-04 10:46:28 -00:00: INF [ARP] Sending gratuitous ARP for 10.137.0.16 (00:16:3e:5e:6c:00)

uplink.ml does:

  Netif.connect "0" >>= fun net ->
  Eth.connect net >>= fun eth ->
  Arp.connect eth >>= fun arp ->

And Eth.connect prints Connected Ethernet interface as its first (and only) action (https://github.com/mirage/ethernet/blob/master/src/ethernet.ml#L88). Your log doesn't show that, so it looks like Netif.connect never returned. The MAC log entry is printed here:

https://github.com/mirage/mirage-net-xen/blob/master/lib/frontend.ml#L102

After that, it writes the network configuration to Xenstore in dom0 and then waits for sys-net to connect. When it's in this stuck state, you should be able to see the state of Xenstore in dom0 using xenstore ls. That will tell us whether the problem is with the firewall or with sys-net.

For example, if xl list shows that sys-net is dom 1 and mirage-firewall is dom 8, then the settings can be viewed with these two commands:

xenstore ls /local/domain/1/backend/vif/8
xenstore ls /local/domain/8/device/vif
brendanhoar commented 4 years ago

Thanks.

As it's on my daily driver, I reverted to one of the 0.6 versions for now. Next time I reboot (and have some time*) I'll set the kernels to 0.7, see if I can repeat the issue, and pull the data you're asking for.

Brendan

* Spouse and I WFH, small house w/ toddler TFH (toddling from home), packing for a move/buying a house, federal civil lawsuit, etc...bit busy. :)

hannesm commented 2 years ago

Any update on this @brendanhoar? We're as well preparing a next release, so you may want to try whether this fixes your issue.

brendanhoar commented 2 years ago

As I plan to migrate to Qubes R4.1 within the next week or two, I will pull down the new 0.8 mirage-firewall version during that effort and test to determine if the issues have resolved. [If not I will pull some more log info.]

Thanks. :)

B

hannesm commented 2 years ago

Thanks for your reply. If you happen to encounter issues with 0.8.2, please open a new report with updated logs. Thanks.

palainp commented 1 year ago

A very similar issue is reported by Viable2381 on https://forum.qubes-os.org/t/libxenlight-failed-unable-to-add-vif-devices/16958/1 but I am not yet able to reproduce it. @brendanhoar would you mind to test again with your 4.1 setup?

palainp commented 1 year ago

I wonder if, like we did in https://github.com/mirage/mirage-net-xen/commit/52d0109fa9117271551b7757d2d2413b57591ebb, it should be safer to wait for backend to be in initialised state (e.g: have something in mirage-net-xen/lib/xenstore.ml, plus a call in mirage-net-xen/lib/frontend.ml just before C.write_frontend_configuration):

  let wait_until_backend_initialised conf =
    Xs.make () >>= fun xsc ->
    Xs.wait xsc (fun h ->
      Lwt.catch
        (fun () ->
          Xs.read h (conf.backend / "state") >>= fun state ->
          let open Xen_os.Device_state in
          match of_string state with
          | Initialised -> return ()
          | Closed        (* XXX: stop waiting? *)
          | Reconfigured  (* XXX: stop waiting? *)
          | _  -> fail Xs_protocol.Eagain
        )
        (function
          | Xs_protocol.Enoent _ -> fail Xs_protocol.Eagain
          | ex -> fail ex
        )
    )