Closed palainp closed 2 years ago
What changed between 0.8.1 and 0.8.2 is mainly 93b92c041bc3a9d243f9e1f674980868f5f56d07 (evenually f2d3faf1da0a12a535df5505964f70115d70a851). I'll try to wrap my head around these commits (and the connected ones in mirage-nat) again.
I have the following in my test VM:
$ qvm-firewall test list
NO ACTION HOST PROTOCOL PORT(S) SPECIAL TARGET ICMP TYPE EXPIRE COMMENT
0 accept www.google.com tcp 443 - - - -
1 accept - - - dns - - -
2 accept - icmp - - - - -
3 drop - - - - - - -
I can reproduce the crash that with the 0.8.2 binary and with the current robur.coop build. I'll try to figure out what is the issue.
It seems to be stuck resolving (with name as www.google.com
): https://github.com/mirage/qubes-mirage-firewall/blob/b09acdeec2d91b8b9767f39b5752de3addd8f17d/rules.ml#L54
Would you mind to compile and try the debug-dns
branch of my fork (hannesm/qubes-mirage-firewall.git)? and provide the output here...
You spotted the right thing, it loops with:
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 41277 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 1739
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 1739 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 58504
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 58504 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 48392
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 48392 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 64865
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 64865 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 54812
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 54812 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 51024
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 51024 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 28598
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 28598 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 50525
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 50525 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 57150
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 57150 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 17536
thanks. would you mind to try mirage-nat from my fork in the fix-is-port-free
branch? could be done with qubes-firewall at the main branch (no need for debug) ;)
Still there's some issue, even with the above fix... namely the resouce "UDP source ports of the firewall" needs to be shared amongst "NAT" and "DNS". This is done by:
Whenever a new port is picked, NAT uses pick_free_port
which picks a random one and avoids the udp_dns and last_resourt_port -- the NAT implementation furthermore ensures that the picked one is good indeed.
Whenever a fresh DNS port is used, free_udp_port is used that asks the NAT implementation (after pick_Free_port) whether the port is not in use.
The issue now is that free_udp_port may loop (nearly) infinitely. Somehow the last_resort_port should be used more frequently / explicitly: after N (10?) tries of a not-used-by-dns port whether NAT uses it, the last_resort should be taken. There's no need to ask NAT whether that one is free. When there's confirmation that the mirage-nat change leads to a working dns-resolving firewall again, I'll re-structure the free_udp_port
code. :)
I'll re-structure the
free_udp_port
code.
See #159
The #159 (and with or without the mirage-nat patch) solve the burn cpu loop, but I now can't get access to www.google.com:443
. I'll try to investigate why soon.
Hmm, maybe try with google.com
and www.google.com
.... They seem to be different IPs and may refer to each other.
Or use a simpler setup, and test with mirage.io
:D
I can see in the log (I simplified the rule to accept any proto/any port for mirage.io):
[2022-11-08 19:25:44] 2022-11-08 18:25:44 -00:00: WRN [rules] Ignoring rule 0 mirage.io accept, could not resolve
[2022-11-08 19:25:44] 2022-11-08 18:25:44 -00:00: WRN [rules] Ignoring rule 0 mirage.io accept, could not resolve
[2022-11-08 19:25:46] 2022-11-08 18:25:46 -00:00: WRN [rules] Ignoring rule 0 mirage.io accept, could not resolve
[2022-11-08 19:25:46] 2022-11-08 18:25:46 -00:00: WRN [firewall] Failed to add NAT rewrite rule: Too many retries (IPv4 packet 10.137.0.10 -> 10.139.1.1: id a3b7, off 16384 proto 6, ttl 64, options
[2022-11-08 19:25:46] TCP packet seq=3919011120 acknum=0 ack=false rst=false syn=true fin=false win=64240 options=
[2022-11-08 19:25:46] [MSS=1460; Noop; Noop; SACK_ok; Noop; Window>> 7] with payload )
[2022-11-08 19:25:48] 2022-11-08 18:25:48 -00:00: WRN [rules] Ignoring rule 0 mirage.io accept, could not resolve
[2022-11-08 19:25:48] 2022-11-08 18:25:48 -00:00: WRN [firewall] Failed to add NAT rewrite rule: Too many retries (IPv4 packet 10.137.0.10 -> 10.139.1.1: id d5ee, off 16384 proto 6, ttl 64, options
[2022-11-08 19:25:48] TCP packet seq=3694308136 acknum=0 ack=false rst=false syn=true fin=false win=64240 options=
[2022-11-08 19:25:48] [MSS=1460; Noop; Noop; SACK_ok; Noop; Window>> 7] with payload )
[2022-11-08 19:25:49] 2022-11-08 18:25:49 -00:00: WRN [rules] Ignoring rule 0 mirage.io accept, could not resolve
The could not resolve
makes me sad.
Could you run that with debug level output? Then we'd see the error (from rules.ml):
| Error (`Msg m) ->
Log.warn (fun f -> f "Ignoring rule %a, could not resolve" Q.pp_rule rule);
Log.debug (fun f -> f "%s" m);
The error message is DNS request timeout
(I can see multiple Resolving mirage.io
messages then after the timeout that many pairs of Ignoring+timeout error message).
Oops I'm faulty there, I've left some firewall rules on my mirage-fw too, that explain why it couldn't resolve some domains. So far with your path on mirage-nat and with the #159 PR it works as intended! Many Thanks @hannesm!
thanks for the report, fixed in the 0.8.3 release.
First reported by fiftyfourthparallel on https://forum.qubes-os.org/t/mirage-firewall-0-8-2-broken-new-users-should-install-0-8-1/14566 It seems that we now have an issue with DNS. I'll try to reproduce the scenario to understand what is the pb root cause.