qubesdb sync blocking dom0+domU processes

xaki23 commented 6 years ago

create/clone a mirage firewall
attach 5+ downstream vms to it
start all these vms
qvm-kill mirage-fw
optional: start tail -f on the mfw console log
optional: start strace -f on qubesdb
qvm-start mirage-fw
try to do anything in qubes-manager
note how qubes manager doesnt react at all, not even local stuff like being able to change highlighted row
looking at mfw-log this is the initial sync of the qubesdb
the client infos arrive with precisely 10 seconds of delay in between
looking at qdb-strace it gets caught in its mainloop pselect, which has a 10sec timeout
so total blockage time is 10 sec * number of downstream vms
replace mirage-fw by a sys-fw clone in this setup, kill+start again, no blockage

both mfw and sysfw seem to be doing a multiread to sync on startup. i was unable to find any significant difference from skimming the code.

but no matter what mfw is doing different there, it should not block dom0 that hard.

filing this on both qubes and mirage side since i see room for improvement on either. (though this might turn purely into a qubes-side thing, depending on closer analysis)

qubes side issue here: https://github.com/QubesOS/qubes-issues/issues/4498

talex5 commented 6 years ago

The firewall doesn't keep any state between runs, so if starting it a second time is causing trouble, then dom0 must be doing something different. What happens if you kill mirage-firewall and start sys-firewall, or the other way around? Maybe there is some QubesDB shutdown protocol you're supposed to do, but mirage-firewall doesn't bother. It does sound like a bug in dom0 though.

xaki23 commented 6 years ago

the difference between the runs is that the "second time" is with a handful of downstream VMs already running. this means there is a lot more qubesdb rows to transfer in the initial qubesdb sync. and the sending process in dom0 seems to take a pselect-nap between sending each appvm tuple.

if this was comms by tcp, my first guess would be "the client is not flushing after ack-write" or so. might be a timing issue as well, with mirage being "too fast".

the blocking itself (and it having global impact) is surely a qubes/dom0 fault. (DoS potential) but mirage must be doing something different (compared to the qubes-C implementation) to trigger it. and i very much would prefer my mfw not to take several minutes on startup, more so considering https://github.com/talex5/qubes-mirage-firewall/issues/35 is hitting me "daily".

were you able to repro it? "worked" here on first try on a clean 4.0.1-rc1 install. (was making sure its not some problem with my 4.0 desktop)

marmarek commented 6 years ago

I'd guess missing evtchn trigger after reading/writing data in vchan. Equivalent of this and this in vchan's C implementation.

talex5 commented 5 years ago

Do you known if there a spec for vchan somewhere, or is the documentation just the C code?

marmarek commented 5 years ago

I'm afraid the latter :/

talex5 commented 5 years ago

I assume it's this pselect: https://github.com/QubesOS/qubes-core-qubesdb/blob/master/daemon/db-daemon.c#L405

It doesn't look to me as though this pselect can hold up a transfer.

If it's part of the the firewall's initial sync, we go to handle_vchan_data, then handle_multiread. That writes all the entries without a pselect, so it should either be fast or block forever. No 10s timeout.
If it's forwarding an update from a local client, we go to handle_client_data, then handle_write, write_vchan_or_client, libvchan_send. This appears to be a blocking send, so again there will be no timeout.

The only thing the pselect could be waiting for from the firewall is a QDB_RESP_OK. But it just throws such messages away and doesn't trigger anything else to happen.

I managed to reproduce the problem myself by restarting the firewall while it still had client VMs attached.

The multiread sync worked fine, but then QubesDB starts pushing updates, two at a time, with 10s between them.

Here's a bit from the strace:

...
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 0 (Timeout)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [4], left {tv_sec=9, tv_nsec=919724312})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
read(4, "\1/mapped-ip/10.137.0.14/visible-"..., 72) = 72
read(4, "10.137.0.14", 11)              = 11
ioctl(5, IOCTL_EVTCHN_NOTIFY, 0x7fff5103796c) = 0
ioctl(5, IOCTL_EVTCHN_NOTIFY, 0x7fff5103796c) = 0
write(4, "\7/mapped-ip/10.137.0.14/visible-"..., 72) = 72
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [4], left {tv_sec=9, tv_nsec=999817609})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
read(4, "\1/mapped-ip/10.137.0.14/visible-"..., 72) = 72
read(4, "10.137.0.19", 11)              = 11
write(4, "\7/mapped-ip/10.137.0.14/visible-"..., 72) = 72
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [5], left {tv_sec=9, tv_nsec=996950903})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(5, "{\0\0\0", 4)                   = 4
write(5, "{\0\0\0", 4)                  = 4
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 0 (Timeout)
...

So it seems that it was waiting for FD 4, which is some kind of local socket connection, not an event channel. So I guess the problem is that whatever is calling QubesDB is taking its time with the updates.

While I was looking at the code, there does also appear to be a race in the vchan path:

QubesDB sends data to the firewall but the buffer is full. libvchan_send blocks, consuming all notifications until space is available.
The firewall sends a message (e.g. QDB_RESP_OK) and signals to say the data is ready.
QubesDB ignores it because it's waiting for space in the other direction, not data.
The firewall confirms that it read the message and signals dom0.
The send completes and dom0 starts blocking on the event channel. It doesn't notice that data is now ready.

However, I do not believe this is the cause of the bug reported, because:

The rings are mostly empty during the pselect delays.
I don't see why a lack of an incoming request from the firewall would prevent QubesDB from sending more updates.

Here's a snapshot of the state of the vchan half way though the 10s timeout, taken by the firewall:

((role
  (Client
   ((shr_map ((mapping <buffer>) (grants ((0 209)))))
    (read_map (((mapping <buffer>) (grants ((0 76))))))
    (write_map (((mapping <buffer>) (grants ((0 214)))))))))
 (cli_notify 1) (srv_notify 1) (left_order ((External 0)))
 (right_order ((External 0))) (remote_domid 0) (remote_port 111)
 (client_state (Connected)) (server_state (Connected))
 (read_producer 4805) (read_consumer 4805) (read "(omitted)")
 (write_producer 216) (write_consumer 216) (write "(omitted)")
 (ack_up_to 4805) (closed false))

So, both sides want to be notified of writes (notify bit 1) but neither cares about reads, and both rings are empty (consumer=producer). The last time the firewall signalled an event, the state was:

send_notify: bit=1 prev=1 (idx=22) notify=true
  state:
  ((role
    (Client
     ((shr_map ((mapping <buffer>) (grants ((0 209)))))
      (read_map (((mapping <buffer>) (grants ((0 76))))))
      (write_map (((mapping <buffer>) (grants ((0 214)))))))))
   (cli_notify 0) (srv_notify 0) (left_order ((External 0)))
   (right_order ((External 0))) (remote_domid 0) (remote_port 111)
   (client_state (Connected)) (server_state (Connected))
   (read_producer 4805) (read_consumer 4639) (read "(omitted)")
   (write_producer 144) (write_consumer 72) (write "(omitted)")
   (ack_up_to 4805) (closed false))

So even if this isn't completely up-to-date, QubesDB must have seen that there was plenty of free space in the rings.

talex5 commented 5 years ago

@marmarek : the above is based on my understanding of the vchan code from reading the source. I've documented what I learnt here: https://github.com/talex5/spec-vchan/blob/master/vchan.tla I see from the Xen logs that you worked on this code - if you spot any inaccuracies in my spec let me know!

marmarek commented 5 years ago

I'll have time to look into it probably only after New Year. But if you by a chance will be at 35C3, we can look into this there. There will be Qubes OS assembly.

talex5 commented 5 years ago

Doing sudo journalctl -f -u qubesd in dom0, I see a lot of this while it's starting the firewall:

Dec 29 18:27:01 dom0 qubesd[8324]: libvirt.libvirtError: internal error: libxenlight failed to attach network device
Dec 29 18:27:01 dom0 qubesd[8324]: Attaching network
Dec 29 18:27:01 dom0 qubesd[8324]: Cannot detach old network
Dec 29 18:27:01 dom0 qubesd[8324]: Traceback (most recent call last):
Dec 29 18:27:01 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 275, in on_domain_started
Dec 29 18:27:01 dom0 qubesd[8324]:     vm.detach_network()
Dec 29 18:27:01 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 340, in detach_network
Dec 29 18:27:01 dom0 qubesd[8324]:     vm=self))
Dec 29 18:27:01 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/app.py", line 93, in wrapper
Dec 29 18:27:01 dom0 qubesd[8324]:     return attr(*args, **kwargs)
Dec 29 18:27:01 dom0 qubesd[8324]:   File "/usr/lib64/python3.5/site-packages/libvirt.py", line 1170, in detachDevice
Dec 29 18:27:01 dom0 qubesd[8324]:     if ret == -1: raise libvirtError ('virDomainDetachDevice() failed', dom=self)
Dec 29 18:27:01 dom0 qubesd[8324]: libvirt.libvirtError: operation failed: no device matching mac address 00:16:3e:5e:6c:00 found
Dec 29 18:27:11 dom0 qubesd[8324]: Cannot attach network
Dec 29 18:27:11 dom0 qubesd[8324]: Traceback (most recent call last):
Dec 29 18:27:11 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 280, in on_domain_started
Dec 29 18:27:11 dom0 qubesd[8324]:     vm.attach_network()
Dec 29 18:27:11 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 327, in attach_network
Dec 29 18:27:11 dom0 qubesd[8324]:     vm=self))
Dec 29 18:27:11 dom0 qubesd[8324]:   File "/usr/lib/python3.5/site-packages/qubes/app.py", line 93, in wrapper
Dec 29 18:27:11 dom0 qubesd[8324]:     return attr(*args, **kwargs)
Dec 29 18:27:11 dom0 qubesd[8324]:   File "/usr/lib64/python3.5/site-packages/libvirt.py", line 563, in attachDevice
Dec 29 18:27:11 dom0 qubesd[8324]:     if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
Dec 29 18:27:11 dom0 qubesd[8324]: libvirt.libvirtError: internal error: libxenlight failed to attach network device
Dec 29 18:27:11 dom0 qubesd[8324]: Attaching network

Note the 10s delay before the Cannot attach network error. Oddly, though, the domains do seem to get reattached just fine, and they can still access the Internet after the firewall is restarted.

marmarek commented 5 years ago

@talex5

so I read the Xen network device attachment specification to check that the firewall implemented that correctly.

I’m kidding, of course. There isn’t any such specification.

There is state diagram for block devices, which is the same for all xenbus devices, including network one: https://github.com/xen-project/xen/blob/master/xen/include/public/io/blkif.h#L452-L523

I recently debugged similar issue with misbehaving backend and 10s timeout looks familiar. It's about waiting for the backend to set XenbusStateInitWait (2) state. You can get more details in /var/log/libvirt/libxl/libxl-driver.log. If that isn't enough, increase verbosity in /etc/libvirt/libvirtd.conf:

# the lower the more verbose, only this (libvirt-global) value is used to configure libxl internal log level
log_level = 1
# now try to avoid libvirt filling all the disk with it's own messages
log_filters = "3:event 3:object 3:net"

With this, you should get full state transitions log, including what exactly was expected and failed.

talex5 commented 5 years ago

Ah, that explained it! When you restart with client VMs attached, qubesd waits until the clients are reconnected to the network before starting the GUI daemon. But mirage-firewall waits for the GUI daemon before accepting new clients. I've removed the wait for the GUI.

talex5 commented 5 years ago

I did, however, end up with a whole stack of UP ethN devices in the app VMs, which might not be good. e.g.

[user@com ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 334  bytes 283264 (276.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 351  bytes 47764 (46.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 26  bytes 5629 (5.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 71  bytes 10663 (10.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 154  bytes 31221 (30.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 388  bytes 46070 (44.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 587  bytes 25218 (24.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 210  bytes 59434 (58.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 182  bytes 39496 (38.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 68  bytes 36027 (35.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 102  bytes 18314 (17.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 109  bytes 36661 (35.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 140  bytes 24896 (24.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 9  bytes 1308 (1.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 53  bytes 4432 (4.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.137.0.15  netmask 255.255.255.255  broadcast 10.255.255.255
        inet6 fe80::216:3eff:fe5e:6c00  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:5e:6c:00  txqueuelen 1000  (Ethernet)
        RX packets 485  bytes 307869 (300.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 503  bytes 86885 (84.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1435  bytes 127253 (124.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1435  bytes 127253 (124.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

andyhhp commented 5 years ago

@marmarek

There is state diagram for block devices, which is the same for all xenbus devices, including network one:

Sadly not true. The connection setup is subtly different between block and net. Block has an extra step in the backend where it tries to open the block device. This is infuriating because it prevents writing a single reference implementation of the state transition logic.

To @talex5 's point in his TLA/vchan blog, yes it would be lovely if any of this was written down.

mirage / mirage-qubes

qubesdb sync blocking dom0+domU processes #25