opnsense / ports

OPNsense ports on top of FreeBSD
https://opnsense.org/
Other
157 stars 114 forks source link

FRR OSPF neighbors not found over Routed IPSEC after reboot #126

Closed wobblywob closed 3 years ago

wobblywob commented 3 years ago

[x] I have read the contributing guide lines at https://github.com/opnsense/plugins/blob/master/CONTRIBUTING.md

[x] I have searched the existing issues and I'm convinced that mine is new.

[x] The title contains the plugin to which this issue belongs

Describe the bug After a firewall reboot, IPSEC tunnels are up but there are no neighbors in FRR-> OSPF, while using Routed IPSEC. They usually appear after restarting the IPSEC service on the opposing device. Didn't have this problem in opnsense 21.1.2.

To Reproduce Steps to reproduce the behavior:

  1. Setup routed IPSEC between 3 locations (in my case)
  2. Setup FRR OSPF routing
  3. tunnels are up, neighbors are found, traffic is flowing
  4. reboot firewall
  5. no ospf neighbors via IPSEC, no traffic is flowing, no routes
  6. restart IPSEC service on the other firewalls
  7. OSPF neighbors found and traffic is flowing

Expected behavior Neighbors should be found and formed.

Relevant log files Don't know what to look for.

Additional context Recently migrated 3 pfsense fws (A,B,C) to opnsense and FRR OSPF previously worked between them without issues. The first (A) opn was setup with 21.1.2 and didn't have this issue. Device B was finnicky with FRR and it was setup with 21.1.4 but the issue became very clear after migrating the last one, C, also on 21.1.4.

After a reboot, downstream OSPF neighbors within the same network are found.

Environment

OPNsense 21.1.4 (amd64, OpenSSL). Proxmox 5.4.73 VM Virtio drivers

johnlaur commented 3 years ago

I discovered the cause of this bug; Routed IPSec tunnels are set up as point-to-point interfaces with a netmask of 0xfffffffc (/30) however due to a problem in FRR, the interface address is parsed as having a subnet mask of /32. This causes FRR to treat the interface as UNNUMBERED and in protocols such as OSPF, it sends the interface id as the peer address instead of the interface address. This, in turn, has the effect that the far end peer will receive the LSA announcements into the database with a bad peer address, and zebra will be unable to apply them to the route table because the next hop address is bogus.

I am unable to discover a satisfactory workaround for this problem. I see no way to force the interface address in FRR on either the near end or the far end. If anyone has any ideas, I'd like to know them.

The error can be seen below in the vtysh command output; FRR is detecting the interface as UNNUMBERED instead of the correct address of 10.0.1.1/30

This pull request fixes the problem; however, it is not getting a lot of traction... If opnsense has a mechanism to distribute their own patched packages, it might be worthwhile to include this patch until it is upstreamed.

root@opnsense-vpn:~ # vtysh -c "show ip ospf interface ipsec1"
ipsec1 is up
  ifindex 7, MTU 1400 bytes, BW 0 Mbit <UP,POINTOPOINT,RUNNING,MULTICAST>
  This interface is 👉 UNNUMBERED, Area 0.0.0.1
  MTU mismatch detection: enabled
  Router ID 10.0.0.1, Network Type POINTOPOINT, Cost: 10
  Transmit Delay is 1 sec, State Point-To-Point, Priority 1
  No backup designated router on this network
  Multicast group memberships: OSPFAllRouters
  Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5
    Hello due in 9.493s
  Neighbor Count is 1, Adjacent neighbor count is 1

root@opnsense-vpn:~ # ifconfig ipsec1
ipsec1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> metric 0 mtu 1400
    tunnel inet 1.2.3.4 --> 4.3.2.1
    inet6 aaaa::bbbb:cccc:dddd:eeee%ipsec1 prefixlen 64 scopeid 0x7
    inet 10.0.1.1 --> 10.0.1.2 netmask 0xfffffffc
    groups: ipsec
    reqid: 1
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
fichtner commented 3 years ago

@johnlaur Hi and thanks! At first glance that makes sense to me. Did you test the patch in production? If @mimugmail is ok we can pick up the patch for our port.

mimugmail commented 3 years ago

Sure, lets do this :)

fichtner commented 3 years ago

@mimugmail judging from the WireGuard changes regarding POINTTOPOINT this won't make it happy yet?

fichtner commented 3 years ago

@johnlaur are you using OpenSSL or LibreSSL? I want to provide a test package. :)

fichtner commented 3 years ago

In any case, for OpenSSL:

# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/snapshots/latest/All/frr7-7.4_5.txz

For LibreSSL:

# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/snapshots/libressl/All/frr7-7.4_5.txz
mimugmail commented 3 years ago

@mimugmail judging from the WireGuard changes regarding POINTTOPOINT this won't make it happy yet?

No, from what I read it claims that this only happens after a reboot. If it's always the case I can try to reproduce in a lab, but I'm sure it will work.

wobblywob commented 3 years ago

It looks like @johnlaur got to the bottom of this. Don't know if it helps at this point, but I've noticed there are no entries in IPSEC -> Security Association Database which immediately appear after restarting the IPSEC service.

johnlaur commented 3 years ago

I am using OpenSSL but I do not think that it matters in this particular case; it appears to have been related with the way that FRR is picking up the interface addresses. I was not able to test the patch myself because I lack any experience with building opnsense.

I did try to pivot today to using a wireguard tunnel instead. I did notice that it was being detected as a BROADCAST link and I manually set it to POINTTOPOINT. In either mode FRR does not appear to be sending any OSPF hello packets over the wireguard interface at all, and I am not sure why. I would appreciate any recent information on making OSPF work over a wireguard tunnel as I am simply unable to get it to work. I am not sure if this is a related issue or not.

Here is what the wireguard p-t-p link looks like to frr when forced to be treated as point-to-point:

root@opnsense-vpn:~ # vtysh -c "show ip ospf interface wg1"
wg1 is up
  ifindex 9, MTU 1420 bytes, BW 0 Mbit <UP,BROADCAST,RUNNING>
  Internet Address 10.0.1.1/30, Broadcast 10.0.1.3, Area 0.0.0.1
  MTU mismatch detection: enabled
  Router ID 10.0.0.1, Network Type POINTOPOINT, Cost: 10
  Transmit Delay is 1 sec, State Point-To-Point, Priority 1
  No backup designated router on this network
  Multicast group memberships: <None>
  Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5
    Hello due in 0.950s
  Neighbor Count is 0, Adjacent neighbor count is 0

Running tcpdump on the wg1 interface shows that OSPF hello packets are coming in from the peer router however opnsense is replying with ICMP unreachable. I see no OSPF hellos going out on the interface from frr:

root@opnsense-vpn:~ # tcpdump -i wg1 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wg1, link-type NULL (BSD loopback), capture size 262144 bytes
14:55:38.911873 IP 10.0.1.2 > 224.0.0.5: OSPFv2, Hello, length 44
14:55:38.911907 IP 10.0.1.1 > 10.0.1.2: ICMP 224.0.0.5 protocol 89 unreachable, length 72

There does not seem to be any firewall rules that are getting in the way and I have added 223.0.0.5/32 to the allowed ips on the WG interface. Debugging this stuff on opnsense is extremely time consuming and frustrating for me to be honest.

fichtner commented 3 years ago

Open/Libre question is just for the right package to install. Did it work or not?

On 23. Apr 2021, at 21:59, John Laur @.***> wrote:

 I am using OpenSSL but I do not think that it matters in this particular case; it appears to have been related with the way that FRR is picking up the interface addresses. I was not able to test the patch myself because I lack any experience with building opnsense.

I did try to pivot today to using a wireguard tunnel instead. I did notice that it was being detected as a BROADCAST link and I manually set it to POINTTOPOINT. In either mode FRR does not appear to be sending any OSPF hello packets over the wireguard interface at all, and I am not sure why. I would appreciate any recent information on making OSPF work over a wireguard tunnel as I am simply unable to get it to work. I am not sure if this is a related issue or not.

Here is what the wireguard p-t-p link looks like to frr when forced to be treated as point-to-point:

@.***:~ # vtysh -c "show ip ospf interface wg1" wg1 is up ifindex 9, MTU 1420 bytes, BW 0 Mbit <UP,BROADCAST,RUNNING> Internet Address 10.0.1.1/30, Broadcast 10.0.1.3, Area 0.0.0.1 MTU mismatch detection: enabled Router ID 10.0.0.1, Network Type POINTOPOINT, Cost: 10 Transmit Delay is 1 sec, State Point-To-Point, Priority 1 No backup designated router on this network Multicast group memberships: Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5 Hello due in 0.950s Neighbor Count is 0, Adjacent neighbor count is 0 Running tcpdump on the wg1 interface shows that OSPF hello packets are coming in from the peer router however opnsense is replying with ICMP unreachable. I see no OSPF hellos going out on the interface from frr:

@.***:~ # tcpdump -i wg1 -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on wg1, link-type NULL (BSD loopback), capture size 262144 bytes 14:55:38.911873 IP 10.0.1.2 > 224.0.0.5: OSPFv2, Hello, length 44 14:55:38.911907 IP 10.0.1.1 > 10.0.1.2: ICMP 224.0.0.5 protocol 89 unreachable, length 72 There does not seem to be any firewall rules that are getting in the way and I have added 223.0.0.5/32 to the allowed ips on the WG interface. Debugging this stuff on opnsense is extremely time consuming and frustrating for me to be honest.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

mimugmail commented 3 years ago

Ospf via WG only works on 21.1.5 with -kmod or all versions before 21.1.3

johnlaur commented 3 years ago

I tested this again with ipsec and https://pkg.opnsense.org/FreeBSD:12:amd64/snapshots/latest/All/frr7-7.4_5.txz from @fichtner

I still had to force the network to be POINTTOPOINT, but at least the far end happily receives the advertisements and applies routes! However the near end (opnsense) shows the routes just fine but does not apply them to the local routing table for some reason. See the vtysh output below:

opnsense-vpn# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

K>* 0.0.0.0/0 [0/0] via 1.1.1.1, 00:14:01
C>* 1.1.1.0/24 [0/1] is directly connected, vmx0, 00:14:01
C>* 10.0.1.0/30 [0/1] is directly connected, ipsec1, 00:14:01
O>* 10.0.1.2/32 [110/10] is directly connected, ipsec1, weight 1, 00:14:01
O   10.0.2.0/24 [110/20] via 10.0.1.2, ipsec1 👉inactive, weight 1, 00:11:10
O   10.0.3.0/24 [110/20] via 10.0.1.2, ipsec1 👉inactive, weight 1, 00:11:10

The routes for 10.0.2.0/24 and 10.0.3.0/24 are computed correctly by frr running on opnsense, however are not installed in the routing table despite the preceding two routes being sufficient to allow them to be installed. They are correctly received on downtream ospf peers from opnsense, and these routers install routes to the opnsense node (which as seen above does not have a route to the subnets). I apologize that I am kind of at my limit for debugging this; I explored some of the deebug options in frr but I could not figure out how to manipulate the configuration or runtime to actually produce debug output to try to trace further than being able to just say "the routes are seen but not applied"

I will try OSPF again with wireguard using the kernel module; thank you very much for the heads up. my first month trying Opnsense has not been confidence inspiring with routing protocols basically not working on anything ☹️ I do know that this is all mostly upstream FRR issues though, and luckily a couple of well placed static routes can keep things patched up for now...

mimugmail commented 3 years ago

Can you post Screenshots of Interface and Network tab? (the details)

mimugmail commented 3 years ago

I set this up in a lab and works great (192.168.10 is local and 192.168.11 is remote):

FW1.localdomain# sh ip ospf route
============ OSPF network routing table ============

============ OSPF router routing table =============
R    192.168.11.4          [10] area: 0.0.0.0, ASBR
                           via 10.251.251.2, ipsec1

============ OSPF external routing table ===========
N E2 192.168.11.0/24       [10/20] tag: 0
                           via 10.251.251.2, ipsec1

FW1.localdomain# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

K>* 0.0.0.0/0 [0/0] via 10.10.12.1, 00:00:39
C>* 10.10.12.0/24 [0/1] is directly connected, vtnet0, 00:00:39
C>* 10.251.251.0/30 [0/1] is directly connected, ipsec1, 00:00:39
C>* 10.255.255.0/24 [0/1] is directly connected, vtnet2, 00:00:39
C>* 192.168.10.0/24 [0/1] is directly connected, vtnet1, 00:00:39
O>* 192.168.11.0/24 [110/20] via 10.251.251.2, ipsec1 onlink, weight 1, 00:00:16
FW1.localdomain# exit
root@FW1:~ # netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.10.12.1         UGS      vtnet0
10.10.12.0/24      link#1             U        vtnet0
10.10.12.101       link#1             UHS         lo0
10.251.251.1       link#8             UHS         lo0
10.251.251.2       link#8             UH       ipsec1
10.255.255.0/24    link#3             U        vtnet2
10.255.255.1       link#3             UHS         lo0
127.0.0.1          link#5             UH          lo0
192.168.10.0/24    link#2             U        vtnet1
192.168.10.1       link#2             UHS         lo0
192.168.11.0/24    10.251.251.2       UG1      ipsec1
fichtner commented 3 years ago

With patched version or without?

On 24. Apr 2021, at 16:58, Michael @.***> wrote:

 I set this up in a lab and works great (192.168.10 is local and 192.168.11 is remote):

FW1.localdomain# sh ip ospf route ============ OSPF network routing table ============

============ OSPF router routing table ============= R 192.168.11.4 [10] area: 0.0.0.0, ASBR via 10.251.251.2, ipsec1

============ OSPF external routing table =========== N E2 192.168.11.0/24 [10/20] tag: 0 via 10.251.251.2, ipsec1

FW1.localdomain# sh ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric,

  • selected route, * - FIB route, q - queued route, r - rejected route

K> 0.0.0.0/0 [0/0] via 10.10.12.1, 00:00:39 C> 10.10.12.0/24 [0/1] is directly connected, vtnet0, 00:00:39 C> 10.251.251.0/30 [0/1] is directly connected, ipsec1, 00:00:39 C> 10.255.255.0/24 [0/1] is directly connected, vtnet2, 00:00:39 C> 192.168.10.0/24 [0/1] is directly connected, vtnet1, 00:00:39 O> 192.168.11.0/24 [110/20] via 10.251.251.2, ipsec1 onlink, weight 1, 00:00:16 FW1.localdomain# exit root@FW1:~ # netstat -nr Routing tables

Internet: Destination Gateway Flags Netif Expire default 10.10.12.1 UGS vtnet0 10.10.12.0/24 link#1 U vtnet0 10.10.12.101 link#1 UHS lo0 10.251.251.1 link#8 UHS lo0 10.251.251.2 link#8 UH ipsec1 10.255.255.0/24 link#3 U vtnet2 10.255.255.1 link#3 UHS lo0 127.0.0.1 link#5 UH lo0 192.168.10.0/24 link#2 U vtnet1 192.168.10.1 link#2 UHS lo0 192.168.11.0/24 10.251.251.2 UG1 ipsec1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mimugmail commented 3 years ago

Unpatched. I also rebooted twice and everything was fine

wobblywob commented 3 years ago

I've updated to OPNsense 21.1.5-amd64 OpenSSL, I've rebooted several times today and so far it looks like it works for me.

johnlaur commented 3 years ago

I believe in my last report there was some other unrelated problem that was preventing frr from applying the local routes; I am not sure that it had anything to do with the issue of the interface being detected as unnumbered. I will really need to set this up completely in isolation to explore properly. I had originally tried to use OSPF over ipsec as a stopgap since OSPF wasnt working over wireguard, however thanks to advice from @mimugmail I am now back to my original configuration with OSPF with wireguard-kmod (instead of wireguard-go) which is behaving exactly as I would have expected. I do still have to force the interface to be treated as POINTTOPOINT since it is not automatically picked up. (I am still using the patched frr from @fichtner )

In my testing of OSPF over ipsec, OSPF would not work properly on a routed ipsec interface using a /30 netmask due to FRR detecting the ipsec interface as UNNUMBERED which is clearly demonstrated by executing show ip ospf interface ipsec1 via vtysh on the near end router. This output was not posted in the test @mimugmail showed, nor was the output of show ip ospf route (or equivalent) from the far end router which were the only two places where the problem wass visible to me. In my case both routers were seeing the neighbors; the near end router was showing OSPF routes, but the far end was not showing OSPF routes.

I should also mention that even with this bug occurring, some OSPF implementations may be more tolerant of the bogus advertisements where the adjacency is being advertised with the ifindex instead of the interface address. In my case the problem was seen with the far end peer running FRR 7.3.1 on Linux. This peer was not tolerant of receiving the incorrect adjacency advertisements. This may very well work between two opnsense boxes both running buggy versions of FRR -- I have not tested this.

Still, my point is that FRR as currently distributed in opnsense without the patch has some bug with interface addresses and types being correctly detected; whether or not it is the only bug or whether or not the patch is a correct and complete fix, I am not able to say.

Again, I'm very sorry that I'm very new to opnsense and still getting acquainted with much of the machinery under the hood. Please bear with me as I continue to learn how to better debug and contribute. Thank you all very much for the engagement on this issue.

mimugmail commented 3 years ago

Can you post Screenshots of Interface and Network tab? (the details)

...

johnlaur commented 3 years ago

Yes; absolutely. Please give me a day or two to set up a better testbed environment.

Edit: Apologize I have not yet had time to do this. It may be a while longer.

wobblywob commented 3 years ago

In any case, for OpenSSL:

# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/snapshots/latest/All/frr7-7.4_5.txz

For LibreSSL:

# pkg add -f https://pkg.opnsense.org/FreeBSD:12:amd64/snapshots/libressl/All/frr7-7.4_5.txz

Are these 2 patches still the relevant ones? I've upgraded opnsense to 2.1.7 and again I have to restart the ipsec service for FRR to apply the routes.

fichtner commented 3 years ago

The patch briefly broke 21.1.6 so I'm closing this issue now...