Closed xaki23 closed 5 years ago
The firewall doesn't keep any state between runs, so if starting it a second time is causing trouble, then dom0 must be doing something different. What happens if you kill mirage-firewall and start sys-firewall, or the other way around? Maybe there is some QubesDB shutdown protocol you're supposed to do, but mirage-firewall doesn't bother. It does sound like a bug in dom0 though.
the difference between the runs is that the "second time" is with a handful of downstream VMs already running. this means there is a lot more qubesdb rows to transfer in the initial qubesdb sync. and the sending process in dom0 seems to take a pselect-nap between sending each appvm tuple.
if this was comms by tcp, my first guess would be "the client is not flushing after ack-write" or so. might be a timing issue as well, with mirage being "too fast".
the blocking itself (and it having global impact) is surely a qubes/dom0 fault. (DoS potential) but mirage must be doing something different (compared to the qubes-C implementation) to trigger it. and i very much would prefer my mfw not to take several minutes on startup, more so considering https://github.com/talex5/qubes-mirage-firewall/issues/35 is hitting me "daily".
were you able to repro it? "worked" here on first try on a clean 4.0.1-rc1 install. (was making sure its not some problem with my 4.0 desktop)
Do you known if there a spec for vchan somewhere, or is the documentation just the C code?
I'm afraid the latter :/
I assume it's this pselect
: https://github.com/QubesOS/qubes-core-qubesdb/blob/master/daemon/db-daemon.c#L405
It doesn't look to me as though this pselect
can hold up a transfer.
If it's part of the the firewall's initial sync, we go to handle_vchan_data
, then handle_multiread
. That writes all the entries without a pselect, so it should either be fast or block forever. No 10s timeout.
If it's forwarding an update from a local client, we go to handle_client_data
, then handle_write
, write_vchan_or_client
, libvchan_send
. This appears to be a blocking send, so again there will be no timeout.
The only thing the pselect could be waiting for from the firewall is a QDB_RESP_OK
. But it just throws such messages away and doesn't trigger anything else to happen.
I managed to reproduce the problem myself by restarting the firewall while it still had client VMs attached.
The multiread sync worked fine, but then QubesDB starts pushing updates, two at a time, with 10s between them.
Here's a bit from the strace:
...
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 0 (Timeout)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [4], left {tv_sec=9, tv_nsec=919724312})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
read(4, "\1/mapped-ip/10.137.0.14/visible-"..., 72) = 72
read(4, "10.137.0.14", 11) = 11
ioctl(5, IOCTL_EVTCHN_NOTIFY, 0x7fff5103796c) = 0
ioctl(5, IOCTL_EVTCHN_NOTIFY, 0x7fff5103796c) = 0
write(4, "\7/mapped-ip/10.137.0.14/visible-"..., 72) = 72
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [4], left {tv_sec=9, tv_nsec=999817609})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
read(4, "\1/mapped-ip/10.137.0.14/visible-"..., 72) = 72
read(4, "10.137.0.19", 11) = 11
write(4, "\7/mapped-ip/10.137.0.14/visible-"..., 72) = 72
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 1 (in [5], left {tv_sec=9, tv_nsec=996950903})
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(5, "{\0\0\0", 4) = 4
write(5, "{\0\0\0", 4) = 4
ioctl(6, _IOC(0, 0x50, 0, 0x30), 0x7fff510385c0) = 0
rt_sigprocmask(SIG_BLOCK, [TERM], [], 8) = 0
pselect6(9, [0 4 5 8], [], NULL, {tv_sec=10, tv_nsec=0}, {[], 8}) = 0 (Timeout)
...
So it seems that it was waiting for FD 4, which is some kind of local socket connection, not an event channel. So I guess the problem is that whatever is calling QubesDB is taking its time with the updates.
While I was looking at the code, there does also appear to be a race in the vchan path:
libvchan_send
blocks, consuming all notifications until space is available.QDB_RESP_OK
) and signals to say the data is ready.However, I do not believe this is the cause of the bug reported, because:
Here's a snapshot of the state of the vchan half way though the 10s timeout, taken by the firewall:
((role
(Client
((shr_map ((mapping <buffer>) (grants ((0 209)))))
(read_map (((mapping <buffer>) (grants ((0 76))))))
(write_map (((mapping <buffer>) (grants ((0 214)))))))))
(cli_notify 1) (srv_notify 1) (left_order ((External 0)))
(right_order ((External 0))) (remote_domid 0) (remote_port 111)
(client_state (Connected)) (server_state (Connected))
(read_producer 4805) (read_consumer 4805) (read "(omitted)")
(write_producer 216) (write_consumer 216) (write "(omitted)")
(ack_up_to 4805) (closed false))
So, both sides want to be notified of writes (notify bit 1) but neither cares about reads, and both rings are empty (consumer=producer). The last time the firewall signalled an event, the state was:
send_notify: bit=1 prev=1 (idx=22) notify=true
state:
((role
(Client
((shr_map ((mapping <buffer>) (grants ((0 209)))))
(read_map (((mapping <buffer>) (grants ((0 76))))))
(write_map (((mapping <buffer>) (grants ((0 214)))))))))
(cli_notify 0) (srv_notify 0) (left_order ((External 0)))
(right_order ((External 0))) (remote_domid 0) (remote_port 111)
(client_state (Connected)) (server_state (Connected))
(read_producer 4805) (read_consumer 4639) (read "(omitted)")
(write_producer 144) (write_consumer 72) (write "(omitted)")
(ack_up_to 4805) (closed false))
So even if this isn't completely up-to-date, QubesDB must have seen that there was plenty of free space in the rings.
@marmarek : the above is based on my understanding of the vchan code from reading the source. I've documented what I learnt here: https://github.com/talex5/spec-vchan/blob/master/vchan.tla I see from the Xen logs that you worked on this code - if you spot any inaccuracies in my spec let me know!
I'll have time to look into it probably only after New Year. But if you by a chance will be at 35C3, we can look into this there. There will be Qubes OS assembly.
Doing sudo journalctl -f -u qubesd
in dom0, I see a lot of this while it's starting the firewall:
Dec 29 18:27:01 dom0 qubesd[8324]: libvirt.libvirtError: internal error: libxenlight failed to attach network device
Dec 29 18:27:01 dom0 qubesd[8324]: Attaching network
Dec 29 18:27:01 dom0 qubesd[8324]: Cannot detach old network
Dec 29 18:27:01 dom0 qubesd[8324]: Traceback (most recent call last):
Dec 29 18:27:01 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 275, in on_domain_started
Dec 29 18:27:01 dom0 qubesd[8324]: vm.detach_network()
Dec 29 18:27:01 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 340, in detach_network
Dec 29 18:27:01 dom0 qubesd[8324]: vm=self))
Dec 29 18:27:01 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/app.py", line 93, in wrapper
Dec 29 18:27:01 dom0 qubesd[8324]: return attr(*args, **kwargs)
Dec 29 18:27:01 dom0 qubesd[8324]: File "/usr/lib64/python3.5/site-packages/libvirt.py", line 1170, in detachDevice
Dec 29 18:27:01 dom0 qubesd[8324]: if ret == -1: raise libvirtError ('virDomainDetachDevice() failed', dom=self)
Dec 29 18:27:01 dom0 qubesd[8324]: libvirt.libvirtError: operation failed: no device matching mac address 00:16:3e:5e:6c:00 found
Dec 29 18:27:11 dom0 qubesd[8324]: Cannot attach network
Dec 29 18:27:11 dom0 qubesd[8324]: Traceback (most recent call last):
Dec 29 18:27:11 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 280, in on_domain_started
Dec 29 18:27:11 dom0 qubesd[8324]: vm.attach_network()
Dec 29 18:27:11 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/vm/mix/net.py", line 327, in attach_network
Dec 29 18:27:11 dom0 qubesd[8324]: vm=self))
Dec 29 18:27:11 dom0 qubesd[8324]: File "/usr/lib/python3.5/site-packages/qubes/app.py", line 93, in wrapper
Dec 29 18:27:11 dom0 qubesd[8324]: return attr(*args, **kwargs)
Dec 29 18:27:11 dom0 qubesd[8324]: File "/usr/lib64/python3.5/site-packages/libvirt.py", line 563, in attachDevice
Dec 29 18:27:11 dom0 qubesd[8324]: if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
Dec 29 18:27:11 dom0 qubesd[8324]: libvirt.libvirtError: internal error: libxenlight failed to attach network device
Dec 29 18:27:11 dom0 qubesd[8324]: Attaching network
Note the 10s delay before the Cannot attach network
error. Oddly, though, the domains do seem to get reattached just fine, and they can still access the Internet after the firewall is restarted.
@talex5
so I read the Xen network device attachment specification to check that the firewall implemented that correctly.
I’m kidding, of course. There isn’t any such specification.
There is state diagram for block devices, which is the same for all xenbus devices, including network one: https://github.com/xen-project/xen/blob/master/xen/include/public/io/blkif.h#L452-L523
I recently debugged similar issue with misbehaving backend and 10s timeout looks familiar. It's about waiting for the backend to set XenbusStateInitWait (2) state. You can get more details in /var/log/libvirt/libxl/libxl-driver.log
. If that isn't enough, increase verbosity in /etc/libvirt/libvirtd.conf
:
# the lower the more verbose, only this (libvirt-global) value is used to configure libxl internal log level
log_level = 1
# now try to avoid libvirt filling all the disk with it's own messages
log_filters = "3:event 3:object 3:net"
With this, you should get full state transitions log, including what exactly was expected and failed.
Ah, that explained it! When you restart with client VMs attached, qubesd
waits until the clients are reconnected to the network before starting the GUI daemon. But mirage-firewall waits for the GUI daemon before accepting new clients. I've removed the wait for the GUI.
I did, however, end up with a whole stack of UP
ethN
devices in the app VMs, which might not be good. e.g.
[user@com ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 334 bytes 283264 (276.6 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 351 bytes 47764 (46.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 26 bytes 5629 (5.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 71 bytes 10663 (10.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 154 bytes 31221 (30.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 388 bytes 46070 (44.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 587 bytes 25218 (24.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 210 bytes 59434 (58.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 182 bytes 39496 (38.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 68 bytes 36027 (35.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 102 bytes 18314 (17.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 109 bytes 36661 (35.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 140 bytes 24896 (24.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 9 bytes 1308 (1.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 53 bytes 4432 (4.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.137.0.15 netmask 255.255.255.255 broadcast 10.255.255.255
inet6 fe80::216:3eff:fe5e:6c00 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:5e:6c:00 txqueuelen 1000 (Ethernet)
RX packets 485 bytes 307869 (300.6 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 503 bytes 86885 (84.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1435 bytes 127253 (124.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1435 bytes 127253 (124.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
@marmarek
There is state diagram for block devices, which is the same for all xenbus devices, including network one:
Sadly not true. The connection setup is subtly different between block and net. Block has an extra step in the backend where it tries to open the block device. This is infuriating because it prevents writing a single reference implementation of the state transition logic.
To @talex5 's point in his TLA/vchan blog, yes it would be lovely if any of this was written down.
create/clone a mirage firewall
attach 5+ downstream vms to it
start all these vms
qvm-kill mirage-fw
optional: start tail -f on the mfw console log
optional: start strace -f on qubesdb
qvm-start mirage-fw
try to do anything in qubes-manager
note how qubes manager doesnt react at all, not even local stuff like being able to change highlighted row
looking at mfw-log this is the initial sync of the qubesdb
the client infos arrive with precisely 10 seconds of delay in between
looking at qdb-strace it gets caught in its mainloop pselect, which has a 10sec timeout
so total blockage time is 10 sec * number of downstream vms
replace mirage-fw by a sys-fw clone in this setup, kill+start again, no blockage
both mfw and sysfw seem to be doing a multiread to sync on startup. i was unable to find any significant difference from skimming the code.
but no matter what mfw is doing different there, it should not block dom0 that hard.
filing this on both qubes and mirage side since i see room for improvement on either. (though this might turn purely into a qubes-side thing, depending on closer analysis)
qubes side issue here: https://github.com/QubesOS/qubes-issues/issues/4498