Closed xaki23 closed 2 years ago
Hi, thanks for this valuable reproducible protocol!
I can see that too now, and I added some logs in ocaml-xenstore/.../xs_client_lwt.ml
. The following shows what is logged during a qvm-start vm && qvm-kill vm
:
...
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Read backend/vif/141/0/ip
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [client_net] add client vif {domid=141;device_id=0} with IP 10.137.0.8
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Read backend/vif/141/0/frontend
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Watch backend/vif/141 4:xs_client.wait
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Read /local/domain/141/device/vif/0/mac
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Watch backend/vif/141/0/ip 4:xs_client.wait
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is transaction_start Transaction_start
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/connected-ips" = "10.137.0.6 10.137.0.2 10.137.0.8"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/connected-ips6" = ""
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.8/visible-ip" = "10.137.0.8"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/mapped-ip/10.137.0.8/visible-gateway" = "10.137.0.7"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got rm "/qubes-firewall/10.137.0.8/"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.137.0.8/policy" = "drop"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.137.0.8/0000" = "action=accept"
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.137.0.8" = ""
...
[2022-10-28 04:08:23] 2022-10-28 08:08:23 -00:00: INF [application] rpc hint is Watch /local/domain/141/device/vif/0/state 9:xs_client.wait
...
[2022-10-28 04:08:32] 2022-10-28 08:08:32 -00:00: INF [application] rpc hint is Unwatch /local/domain/141/device/vif/0/state 9:xs_client.wait
...
[2022-10-28 04:08:32] 2022-10-28 08:08:32 -00:00: INF [client_net] Client 141 (IP: 10.137.0.8) ready
[2022-10-28 04:08:32] 2022-10-28 08:08:32 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
...
[2022-10-28 04:08:32] 2022-10-28 08:08:32 -00:00: INF [application] rpc hint is Watch /local/domain/141/device/vif/0/state 10:xs_client.wait
...
[2022-10-28 04:09:50] 2022-10-28 08:09:50 -00:00: INF [application] rpc hint is Rm backend/vif/141/0
[2022-10-28 04:09:50] 2022-10-28 08:09:50 -00:00: INF [application] rpc hint is Unwatch /local/domain/141/device/vif/0/state 10:xs_client.wait
...
[2022-10-28 04:09:50] 2022-10-28 08:09:50 -00:00: INF [client_net] client {domid=141;device_id=0} has gone
Here we have 2 watch
not unwatch
ed and this may explain that, at some point, we have too much watch
running (as the client id is different for each run) and we get over_quota (E2BIG
error). I guess it's possible to unwatch
in the firewall when we print the client %d has gone
message (at client_net.ml:155
) or in the cleanup callbacks (client_net.ml:108
or client_net.ml:128
).
But I wonder if we might run into another, partially related, problem: Is it be possible to connect simultaneously 64 clients to the same qubes-firewall? Unfortunately I don't have enough memory to try to run that much vm on my laptop :/
Or maybe it makes more sense to add the unwatch call in watcher.cancel
: https://github.com/mirage/ocaml-xenstore/blob/2168f073ab5f0d8a609db2bc8c52d67fcd814398/client_lwt/xs_client_lwt.ml#L122 ?
But I wonder if we might run into another, partially related, problem: Is it be possible to connect simultaneously 64 clients to the same qubes-firewall? Unfortunately I don't have not enough memory to try to run that much vm on my laptop :/
interestingly, it didnt even get to 64 ... it crashed on VM 43 startup. same backtrace. reproducible. with 42 downstream VMs running, restarting the mirage-fw works ... but when trying to add one more it breaks.
a linux FW starts acting up even earlier, with 37 downstream VMs working, after that new ones dont get configured and outbound q-rpc from the FW starts to fail. no crash though.
i suggest we ignore this scenario for the moment since a) the limit is higher with mirage than with linux and b) its probably a rare situation and c) easily mitigated by running another mirage FW.
(edit) and the reason for "42 instead of 64" is in your log: a still-running VM has three watches, a stopped one only two because one actualy gets unwatched. 64/3*2 == 42 ... so yes, the "42 concurrent running downstream VMs" actualy makes sense.
I agree that we can safely ignore that scenario :)
It's still a _to_be_correctlywritten patch, but I think we're on the good track with unwatch vif for deconnecting guests. With this commit https://github.com/palainp/qubes-mirage-firewall/commit/f02d71386752953fc5f4acd84efb3867f65be5a4 I managed to qvm-start vm && qvm-kill vm
64+70 times in 2 runs without any fault nor restarting qubes-firewall.
I'm still not sure how to properly get the Xenstore token from client.wait
(currently an hard coded value...), and where it's better to call unwatch
. @talex5 or @hannesm do you have some ideas on those points ?
Update: To my understanding, a Watch event is up for backend/vif/DOMID
between Read backend/vif/DOMID/DEVICEID/frontend
and /local/domain/DOMID/device/vif/DEVICEID/mac/
. And the other Watch comes just after that, and then both Watch (backend/vif/DOMID
and backend/vif/DOMID/DEVICEID/ip
) are currently silently dropped at https://github.com/mirage/ocaml-xenstore/blob/2168f073ab5f0d8a609db2bc8c52d67fcd814398/client_lwt/xs_client_lwt.ml#L193 when the client VM is disconnecting.
I'm not sure how is it supposed to behave and if a sort of unwatch is expected without calling unwatch. I'll continue my investigations on that.
Would you mind to give #160 a try? :) (see commit message if the reasoning is of interest)
no, unfortunately I don't understand xenstore and ocaml-xenstore (and don't feel like modifying the implementation of "wait" to suit our needs (I tried hard and failed)).
this has been fixed in the 0.8.3 release, thanks for your report, and please re-open if you see a regression.
(sorry for the late response)
with a fresh build (~0.8.4) i can no longer repro the problem. a test VM did ~135 client restarts over the last hours without crashing. thanks!
this is most likely the same issue as https://github.com/mirage/qubes-mirage-firewall/issues/35 but i am creating a fresh one instead of reopening because the backtrace is very different due to all the changes that happened in the last 4 (!) years.
repro steps:
the E2BIG right after add-client-vif suggests its still the "xenstore subscriptions leaking" issue. see https://github.com/mirage/qubes-mirage-firewall/issues/35 and the various issues linked from it.