mirage / qubes-mirage-firewall

A Mirage firewall VM for QubesOS
208 stars 28 forks source link

DNS pb with v0.8.2 #158

Closed palainp closed 1 year ago

palainp commented 1 year ago

First reported by fiftyfourthparallel on https://forum.qubes-os.org/t/mirage-firewall-0-8-2-broken-new-users-should-install-0-8-1/14566 It seems that we now have an issue with DNS. I'll try to reproduce the scenario to understand what is the pb root cause.

hannesm commented 1 year ago

What changed between 0.8.1 and 0.8.2 is mainly 93b92c041bc3a9d243f9e1f674980868f5f56d07 (evenually f2d3faf1da0a12a535df5505964f70115d70a851). I'll try to wrap my head around these commits (and the connected ones in mirage-nat) again.

palainp commented 1 year ago

I have the following in my test VM:

$ qvm-firewall test list
NO  ACTION  HOST            PROTOCOL  PORT(S)  SPECIAL TARGET  ICMP TYPE  EXPIRE  COMMENT
0   accept  www.google.com  tcp       443      -               -          -       -
1   accept  -               -         -        dns             -          -       -
2   accept  -               icmp      -        -               -          -       -
3   drop    -               -         -        -               -          -       -

I can reproduce the crash that with the 0.8.2 binary and with the current robur.coop build. I'll try to figure out what is the issue.

palainp commented 1 year ago

It seems to be stuck resolving (with name as www.google.com): https://github.com/mirage/qubes-mirage-firewall/blob/b09acdeec2d91b8b9767f39b5752de3addd8f17d/rules.ml#L54

hannesm commented 1 year ago

Would you mind to compile and try the debug-dns branch of my fork (hannesm/qubes-mirage-firewall.git)? and provide the output here...

palainp commented 1 year ago

You spotted the right thing, it loops with:

[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 41277 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 1739
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 1739 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 58504
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 58504 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 48392
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 48392 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 64865
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 64865 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 54812
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 54812 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 51024
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 51024 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 28598
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 28598 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 50525
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 50525 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 57150
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] nat says UDP port 57150 is not free
[2022-11-08 18:15:03] 2022-11-08 17:15:03 -00:00: INF [my-nat] trying UDP port 17536
hannesm commented 1 year ago

thanks. would you mind to try mirage-nat from my fork in the fix-is-port-free branch? could be done with qubes-firewall at the main branch (no need for debug) ;)

hannesm commented 1 year ago

Still there's some issue, even with the above fix... namely the resouce "UDP source ports of the firewall" needs to be shared amongst "NAT" and "DNS". This is done by:

Whenever a new port is picked, NAT uses pick_free_port which picks a random one and avoids the udp_dns and last_resourt_port -- the NAT implementation furthermore ensures that the picked one is good indeed.

Whenever a fresh DNS port is used, free_udp_port is used that asks the NAT implementation (after pick_Free_port) whether the port is not in use.

The issue now is that free_udp_port may loop (nearly) infinitely. Somehow the last_resort_port should be used more frequently / explicitly: after N (10?) tries of a not-used-by-dns port whether NAT uses it, the last_resort should be taken. There's no need to ask NAT whether that one is free. When there's confirmation that the mirage-nat change leads to a working dns-resolving firewall again, I'll re-structure the free_udp_port code. :)

hannesm commented 1 year ago

I'll re-structure the free_udp_port code.

See #159

palainp commented 1 year ago

The #159 (and with or without the mirage-nat patch) solve the burn cpu loop, but I now can't get access to www.google.com:443. I'll try to investigate why soon.

hannesm commented 1 year ago

Hmm, maybe try with google.com and www.google.com.... They seem to be different IPs and may refer to each other.

Or use a simpler setup, and test with mirage.io :D

palainp commented 1 year ago

I can see in the log (I simplified the rule to accept any proto/any port for mirage.io):

[2022-11-08 19:25:44] 2022-11-08 18:25:44 -00:00: WRN [rules] Ignoring rule 0   mirage.io   accept, could not resolve
[2022-11-08 19:25:44] 2022-11-08 18:25:44 -00:00: WRN [rules] Ignoring rule 0   mirage.io   accept, could not resolve
[2022-11-08 19:25:46] 2022-11-08 18:25:46 -00:00: WRN [rules] Ignoring rule 0   mirage.io   accept, could not resolve
[2022-11-08 19:25:46] 2022-11-08 18:25:46 -00:00: WRN [firewall] Failed to add NAT rewrite rule: Too many retries (IPv4 packet 10.137.0.10 -> 10.139.1.1: id a3b7, off 16384 proto 6, ttl 64, options 
[2022-11-08 19:25:46]  TCP packet seq=3919011120 acknum=0 ack=false rst=false syn=true fin=false win=64240 options=
[2022-11-08 19:25:46] [MSS=1460; Noop; Noop; SACK_ok; Noop; Window>> 7] with payload )
[2022-11-08 19:25:48] 2022-11-08 18:25:48 -00:00: WRN [rules] Ignoring rule 0   mirage.io   accept, could not resolve
[2022-11-08 19:25:48] 2022-11-08 18:25:48 -00:00: WRN [firewall] Failed to add NAT rewrite rule: Too many retries (IPv4 packet 10.137.0.10 -> 10.139.1.1: id d5ee, off 16384 proto 6, ttl 64, options 
[2022-11-08 19:25:48]  TCP packet seq=3694308136 acknum=0 ack=false rst=false syn=true fin=false win=64240 options=
[2022-11-08 19:25:48] [MSS=1460; Noop; Noop; SACK_ok; Noop; Window>> 7] with payload )
[2022-11-08 19:25:49] 2022-11-08 18:25:49 -00:00: WRN [rules] Ignoring rule 0   mirage.io   accept, could not resolve
hannesm commented 1 year ago

The could not resolve makes me sad.

hannesm commented 1 year ago

Could you run that with debug level output? Then we'd see the error (from rules.ml):

      | Error (`Msg m) ->
        Log.warn (fun f -> f "Ignoring rule %a, could not resolve" Q.pp_rule rule);
        Log.debug (fun f -> f "%s" m);
palainp commented 1 year ago

The error message is DNS request timeout (I can see multiple Resolving mirage.io messages then after the timeout that many pairs of Ignoring+timeout error message).

palainp commented 1 year ago

Oops I'm faulty there, I've left some firewall rules on my mirage-fw too, that explain why it couldn't resolve some domains. So far with your path on mirage-nat and with the #159 PR it works as intended! Many Thanks @hannesm!

hannesm commented 1 year ago

thanks for the report, fixed in the 0.8.3 release.