troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
194 stars 86 forks source link

pimreg tunnel seems broken #186

Closed realdream closed 2 years ago

realdream commented 3 years ago

test env

host system: ubuntu 20.04 vhost system: ubuntu 20.04 pimd version: build from src b41fb72156d (latest master by now)

minimum test topology

topology drawio

modified pimd configure

disable-vifs
phyint br0 enable
phyint enp1s0 enable

abnormal phenomena

receiver 1 did not receive any message from sender 1 while

clues

when only run sender 1 & receiver 1

realdream commented 3 years ago

additional configuration

troglobit commented 3 years ago

I think I'm seeing your problem, again I'm testing this in CORE, so not exactly the same setup as you.

two-ospf-routers-iperf

It is veeeeery slow to start, so definitely some problem still with latest pimd code. Possibly related to #184, but doesn't really help with that patch. As I suspected, not ready for release yet :-/

It's a bit quicker to get going if I let receiver 1 start first, and wait a couple of seconds before I start sender 1. In my case the first router (R1) sends to eth0, not pimreg, but the second router is stuck waiting for data on pimreg, while data is coming in on its eth0. So they seem to be out of sync.

troglobit commented 3 years ago

Interestingly this problem seems to be isolated to PIM-SM only, with PIM-SSM, (S,G) join in 232/8 range, it works almost perfectly. At least from what I can see.

Edit: because only PIM-SM uses the register tunnel ... :roll_eyes: ... sorry for obvious comment. Good, however, that something works. Still a lot todo, graft on the mrouted changes, read up on the RFCs, before I'll get around to diving into this issue more hands-on. So anyone that can help out debug this particular one is more than welcome!

realdream commented 3 years ago

In my case the data flow ( or mroute ) is like:

BSR                               |  |       Not BSR                            
sender 1 ---> br0 ---> pimreg ----|  |----->pimreg--->br0---> receiver 1         not working
sender 2 ---> br0 ---> enp1s0 ----|  |----->enp1s0--->br0---> receiver 2         working

so in my case there is no out of sync issue. while pimreg is like a broken tunnel that lost everything.

brun064 commented 3 years ago

In addition to disabling multicast_snooping, did you enable multicast_querier on the bridges? I've found that both are needed to route multicast through a bridge interface.

troglobit commented 3 years ago

I'm not sure if I fixed this bug, but I just pushed a set of changes to the master branch that at least seems to work better for me. I can now see multicast data coming in over the pimreg interface on the RP. Maybe you can give it a spin when you have the time?

realdream commented 3 years ago

Do my testing case again with latest master(9f758e). still the same result. also tried enable/disabe querier on bridge.

troglobit commented 3 years ago

Hmm, OK. I just tried swapping the sender and receiver (should've tested that before), and now I don't get any data at all. Like you. I'll investigate this further, but cannot make any commitments since this is the last day of my vacation.

The issue with "Operation not permitted" is definitely caused by the firewall/nat. It's the kernel responding EPERM on a sendto() syscall. This usually happens when a (usually implicit) block rule is hit in the firewall. I recommend users of pimd to not try and run it with NAT, it wasn't designed for that. See for example issue #126 for the troubles that can ensue. Instead, use a GRE tunnel to connect sites, or GRE over IPsec, or a plain OpenVPN tunnel.

realdream commented 3 years ago

Yes. "Operation not permitted" is caused by NAT, I also tried to disable NAT, got pimd[106960]: find_route: Not a valid host (0.0.0.0) ... at commit b41fb72, but seems no langer exist in commit 9f758e. however pimdreg broken issue is still there

troglobit commented 3 years ago

I'll make an effort during the weekend to hunt this one down.

troglobit commented 2 years ago

This took a lot longer to get back to than I anticipated. I've now set up a few automated tests to easier reproduce issues like this, and in the first tests to actually require an RP I ran into the same issue. Then it struck me like a ton of bricks, rp_filter!

The reason, it seems, pimd worked better for me in the past is that I ran an older version of Ubuntu back then and since they've changed their defaults in cat /etc/sysctl.d/10-network-security.conf to enable rp_filter=2, i.e. "loose" mode. Even though "loose" is better than "strict" mode, it doesn't really help decapsulated traffic that comes in on pimreg when there's no reverse-path to the (encapsulated) source IP. Linux happily drops the packet in skb_tunnel_rx() ...

Only way around it, that I can see with pimd, is to disable rp_filter on all interfaces used for multicast routing. In my tests (will push later tonight CET), this is what I've done and had successful results with.

troglobit commented 2 years ago

There, finally works on someone eleses computer as well https://github.com/troglobit/pimd/actions/runs/1255356383 tests are available in the new test/ subdirectory, maybe not entirely readable shell script, apologies.

troglobit commented 2 years ago

Closing issue. There is now a Troubleshooting Checklist that mentions rp_filter