Closed talex5 closed 4 years ago
Thanks for reporting this, I'd love to learn more about it. Is there a way we can reproduce this bug? Can you log incoming network packages before the crash?
I added some logging around setting up new clients after the last crash, but it hasn't happened again since then :-/
OK, it crashed again. Here's the debugging I added:
https://github.com/mirage/qubes-mirage-firewall/compare/master...talex5:debug?expand=1
This was the last thing in the log:
2020-05-26 03:12:40 -00:00: INF [client_net] Client 9 (IP: 10.137.0.12) ready
2020-05-26 03:12:40 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-26 03:12:40 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-26 03:12:40 -00:00: INF [client_net] Getting rules...
2020-05-26 03:12:40 -00:00: INF [client_net] New firewall rules for 10.137.0.12
0 any accept
In particular, it didn't get as far as Router.add_client
.
I added some more debugging:
let remove_connections t ports ip =
Log.info (fun f -> f "remove_connections: enter");
let freed_ports = Nat.remove_connections t.table ip in
Log.info (fun f -> f "tcp");
Now after it crashed the log ended with:
2020-05-29 04:45:31 -00:00: INF [client_net] Client 7 (IP: 10.137.0.8) ready
2020-05-29 04:45:31 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-29 04:45:31 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-29 04:45:31 -00:00: INF [client_net] calling got_new_commit
2020-05-29 04:45:31 -00:00: INF [client_net] setting rules
2020-05-29 04:45:31 -00:00: INF [client_net] Getting rules...
2020-05-29 04:45:31 -00:00: INF [client_net] New firewall rules for 10.137.0.8
0 any accept
2020-05-29 04:45:31 -00:00: INF [client_net] remove_connections
2020-05-29 04:45:31 -00:00: INF [my-nat] remove_connections: enter
So I guess the problem is in Nat.remove_connections
. Perhaps this uses a lot of stack when there are many connections?
Hi @talex5 , thanks for the detailed log output, very helpful! We looked into remove_connections
and were able to reduce the stack size.
Can you try pinning git+http://github.com/mirage/mirage-nat.git#no-stack-overflow
and see if it fixes your problem?
OK, testing now with RUN opam pin -n mirage-nat 'https://github.com/mirage/mirage-nat.git#no-stack-overflow'
added to the Dockerfile
.
Seems to be working fine now - thanks!
This looks the same as the problem in https://github.com/mirage/qubes-mirage-firewall/pull/96#issuecomment-631361687, but this time reproduced using the final 0.7 release binary (hash 4f4456b5fe7c8ae1ba2f6934cf89749cf6aae9a90cce899cf744c89d311467a3) and with a 64MB memory allocation. I haven't seen this happen before #96 was merged, and it has now happened twice in three days.
xl dmesg
shows:The last thing shown in the guest log was: