pimd slow to add MFC entries

ryangubele commented 9 years ago

Hi @troglobit,

I'm working with pimd with the following topology:

Four networks: n1, n2, n3, n4. n1 is a management LAN. Multicast sources are on n2. n3 is an OpenVPN TAP tunnel. Multicast receivers are on n4.

RP1 is the rendezvous point, a Cisco router. It is connected to n1 and n2. SRC is the OpenVPN server, running Ubuntu 14.04. It is connected to n1, n2, and n3. PI1 is the client, running Raspbian. It is connected to n3 and n4.

SRC and PI1 are running pimd. RP1 is the Rendezvous Point.

Unicast routing protocol is OSPF and appears to be stable.

n1 is 10.153.1.0/24. n2 is 10.153.255.0/24. n3 is 10.153.251.0/24. n4 is 10.153.4.128/27. n5 is 192.168.88.0/24. n5 is the network that pi1 is connected to for connectivity, but it's not part of the multicast routing topology as the router that connects n5 to n1 is not pim capable, hence the tunnels. It's part of the OSPF area, but the connected route takes precedence over the learned one on PI1 and I've installed a pinhole route to the VPN server. It's a bit of a hack, but these aren't meant to run on n5 anyway, it's just a testing area.

pimd seems to be working properly, except that it takes an extremely long time (anywhere from 5 seconds to 60 seconds) to add the MFC entry to the kernel routing table and start forwarding packets. With two pimd's between n4 and n2, that makes for extremely slow and arbitrary starts to multicast streams. Replacing SRC and PI1 with Cisco gear, the time is on the order of milliseconds (I haven't measured it enough, but it's fast enough that I can't count it in my head and it feels "instant," which implies it's probably less than 100ms).

Everything works fine, so long as I'm willing to wait a long time for my stream to start. igmpproxy works fine and has response speeds similar to the Cisco gear (with occasional blips of up to a few seconds that the Cisco's don't experience), so this seems to be an issue with pimd specifically.

From the debug log, it looks like pimd is resolving everything to the correct interfaces and is building the correct route internally. It just takes too long to update the MFC. The receiving devices often time out before it happens and leave/join, which causes the whole process to start over, and can result in a stream taking many minutes to start.

This is currently a static RP setup with only RP1 as the RP for all the streams. The streams are all in 239.0.1.0/24.

pimd version 2.3.0.

ryangubele commented 9 years ago

I can't seem to attach the full debug logs, and they are too long to paste as a comment, so here are some of the relevant sections. Let me know how to get the full logs to you if you need them.

On PI1:

Initial group report:

15:15:00.949 accept_group_report(): igmp_src 10.153.4.131 ssm_src 239.0.1.10 group 239.0.1.10 report_type 22 15:15:00.949 accept_group_report(): al_pv=2

Takes 2 minutes 25 seconds to start forwarding:

15:17:25.251 move_kernel_cache: SG 15:17:25.251 Added kernel MFC entry src 10.153.255.2 grp 239.0.1.10 from tap0 to eth1

First mention of the group on src1:

15:14:44.970 Cache miss, src 10.153.255.2, dst 239.0.1.10, iif 1 15:14:44.970 findroute: Group not found. Return the (,_,RP) entry 15:14:44.970 Switch shortest path (SPT): src 10.153.255.2, group 239.0.1.10 15:14:44.970 create group entry, group 239.0.1.10

PIM join:

15:15:00.483 Received PIM JOIN/PRUNE from 10.153.251.10 on tap1 15:15:00.483 Received PIM JOIN from 10.153.251.10 to group 239.0.1.10 for multicast source 10.153.255.1 on tap1

MFC, 5 seconds later:

15:15:05.002 Added kernel MFC entry src 10.153.255.2 grp 239.0.1.16 from eth1 to tap0 15:15:05.002 Cache miss, src 10.153.255.2, dst 239.0.1.51, iif 1

So the first pimd took 5 seconds to start forwarding, compared to milliseconds on Cisco gear. The second took minutes. It seems to be much faster after a fresh start (which I did to capture the debug logs), but if I were to leave these running with debug logging for some time, it would be easy to spot examples of each hop taking 20 seconds or more to start forwarding.

troglobit commented 9 years ago

Hi @ryangubele! Yeah, I've noticed this slugghishness myself. Have no clue really what's up ...

I'm acting more like a focal point of patches, making sure to integrate them and releasing semi-regularly. My intimate knowledge of the internals of pimd is unfortunately not good yet. Sorry!

So, if you figure out where (in the code) the bottle neck is I'm more than happy to look at it, or ever better merge a patch for it.

am88b commented 9 years ago

Hello,

I was just going to post about the same issue. Also I found similar problem here (no solution): https://www.ietf.org/mail-archive/web/pim/current/msg02305.html

My lab environment is the following:

[MC streamer] -> [SW] -> [router] -> [another SW] -> [MC client]

If [MC client] asks for stream, it will take up to a minute or more before [router] actually creates the route with "Added kernel MFC entry...". The time it takes for the multicast stream to be started to flow seems totally random for me.

Router is with Debian Kernel 3.16.0-0.bpo.4-amd64 and pimd-2.3.0

I tried to play around with different timeout/period values (both in pimd.conf and in the source header files), and pretty much tried everything that made sense to me but didn't get better solution. Also I tried to debug the code but so far I only learned that the route is not created because of !VIFM_ISEMPTY(mrt->oifs) check fails at (route.c:875)

ryangubele commented 9 years ago

I'm more perl/python and less C, but I can dig around. Would definitely appreciate anyone else's assistance on this though.

troglobit commented 9 years ago

The more the merrier! ;)

If someone comes up with an idea, we can start messing about on a branch. I can help to the best of my abilities (at least I know C fairly well after 20 years now ;) to facilitate, but I'm fairly limited in what I can test ... so maybe we can complement each other?

We can use this issue for synching, I think you can even email replys ...

am88b commented 9 years ago

I would be glad to test and debug, but right now I am out of ideas what to do.

Some throughts:

1) I tried to check the debug logs and understand what is the difference between simply going through the "process_cache_miss()" function and between actually creating the route, but didn't see any pattern.

2) What is "mrt->oifs" ? (mc output interfaces?) and why it might be that it is empty at this point and ok at some time later (when the route is created)?

3) Is there any "#define" parameter or "/proc" value to try changing maybe to make it generally 'snappier' or make some timeouts smaller (possibly even unsuitable for live environment) to see where the problem might be?

troglobit commented 9 years ago

@am88b:

1.That's a bit of a mystery to me as well, but I think process_cache_miss() is where you arrive when you receive an unknown multicast stream. I.e., the kernel alerts you that "Hey, there's some multicast coming in here ... what now?"

mrt->oifs is, as you rightly guessed, the outbound interfaces for a multicast route "mrt".
Nope, not that I'm aware of, pimd is very stand alone and was made for generic UNIX (*BSD) systems where /proc doesn't even exist. Messing about with the -d debug flags may help though.

am88b commented 9 years ago

"Unknown" as if "bad, I don't know about it" or "Unknown" as if "unprocessed for this time period" ?
Do you know in what point (theoretically and also in code) "mrt->oifs" are populated? I guess I should continue debugging there...
All tests were made with "-f --loglevel=7 --debug=all" so not much help there.

troglobit commented 9 years ago

The latter, inbound unknown multicast from a sender (since pimd acts as a querier it will per RFC always receive all multicast on a LAN.)
I believe it starts when a PIM router receives a PIM_JOIN message, so in pim_proto.c:receive_pim_join_prune()

The bare necessities of getting PIM-SM/SSM running on a set of routers is that the unicast routing table is setup properly. I.e., so that end nodes (multicast sender and receiver) can ping each other. Also, the sender of the multicast must have a TTL >1 ... the default in most operating systems is to set TTL=1 when the destination IP address is a multicast group.

idismmxiv commented 9 years ago

I took a quick look into this. Not sure if this is the case, but I throw my two cents in anyway. Looks that MFC entry is added only after "cache miss". Once cache miss occurs and there is no PIM JOINS for that multicast group, given "cache miss" is more or less ignored by pimd. Now, ipmr.c in Linux kernel buffers packets matching to that cache miss (unresolved). It will not send another cache miss to pimd, if one is already sent. There seems to be buffer for few packets and once that buffer gets filled, rest are dropped. This unresolved buffer is cleared by timer (ipmr_expire_process). I couldn't figure out how often that buffer is cleared. Once buffer is cleared, one cache miss is sent to pimd again.

So, one option for this random delay could this cache miss process from kernel to pimd.

am88b commented 9 years ago

@troglobit: I checked, unicast routing table is OK - I checked ping both ways. Also MC traffic has TTL=16. (IGMP traffic has TTL=1, but I think this is how it is supposed to be)

@idismmxiv: Thank you for the insight. It sounds promising. Are you planning to take a deeper look? Is there somehow I can be of help to debug this?

troglobit commented 9 years ago

@idismmxiv Good catch! :smiley: Hopefully someone can have a look at it to see if it has any bearing on this issue.

@am88b Yep, IGMP traffic should have TTL=1.

idismmxiv commented 9 years ago

Continued playing with this. I only tested config: [MC streamer] -> [SW] -> [router] -> [another SW] -> [MC client] where router was RP and BSR as well. I got it working fast by adding MFC when new leaf is joining multicast stream. Here is pure hack for you to test if you want and to think actual fix. I didn't want to add new param to change_interfaces function right a way as it should be added so many places. I suppose there are other places as well where similar MFC creation should be added.

diff --git a/route.c b/route.c index f76a933..8256a02 100644 --- a/route.c +++ b/route.c @@ -237,6 +237,8 @@ int set_incoming(srcentry_t *src, int type) }

+static uint32_t group_for_testing; + /*

TODO: XXX: currently source is not used. Will be used with IGMPv3 where
we have source-specific Join/Prune. @@ -257,6 +259,9 @@ void add_leaf(vifi_t vifi, uint32_t source, uint32_t group) return; }
/* Store group where new leaf is joining */
group_for_testing = group; + /*
- XXX: only if I am a DR, the IGMP Join should result in creating
- a PIM MRT state.

@@ -491,6 +496,7 @@ int change_interfaces(mrtentry_t mrt, mrtentry_t srcs; mrtentry_t mwc; mrtentry_t mrp;

srcentry_t source_entry; int delete_mrt_flag; int result; int fire_timer_flag; @@ -671,6 +677,16 @@ int change_interfaces(mrtentry_t mrt, srcs->asserted_oifs, flags)) fire_timer_flag = TRUE; } +
/* Add kernel mfc entry when new leaf is added and input vif already exists */
for (source_entry = srcs->source; source_entry; source_entry = source_entry->next) {
if (source_entry->incoming != 0 && mrt->group->mrtlink->oifs != 0) {
add_kernel_cache(mrt->group->mrtlink, source_entry->address, group_for_testing, 0);
k_chg_mfc(igmp_socket, source_entry->address, group_for_testing, source_entry->incoming,
mrt->group->mrtlink->oifs, mrt->group->rpaddr);
}
```
     }
```
+ } if (fire_timer_flag == TRUE) @@ -872,6 +888,10 @@ static void process_cache_miss(struct igmpmsg *igmpctl)
- install negative cache entry in the kernel (oif==NULL) to prevent
- too many upcalls. */
/* Just to make empty routes to stay alive as well */
if (mrt->timer < 30)
SET_TIMER(mrt->timer, 30); + if (mrt->incoming == iif) { if (!VIFM_ISEMPTY(mrt->oifs)) { if (mrt->flags & MRTF_SG) {

am88b commented 9 years ago

@idismmxiv Thank you for the patch! Unfortunately I didn't notice any change whatsoever... I was probably testing the wrong thing, but anyway it didn't fix the original issue I had...

However, during testing this patch and reading your very useful comments, I started to think and I discovered something: the creation of MFC entry is slow only in case "Iif" is unresolved.

Eg, something like that:

1) pimd is started, "ip mroute" shows on router: (10.222.4.45, 237.0.0.37) Iif: unresolved 2) first client attaches (took about 60 seconds), "ip mroute" shows: (10.222.4.45, 237.0.0.23) Iif: eth1.241 Oifs: eth1.201 3) second client attaches (less than 1 second!), "ip mroute shows" (10.222.4.45, 237.0.0.23) Iif: eth1.241 Oifs: eth1.201 eth1.202 4) second client leaves and rejoins later - still takes less than 1 second! 5) but if both first and second client leave: (10.222.4.45, 237.0.0.37) Iif: unresolved 6) ...then now when any client tries to join, it again takes about 60 seconds

The same is observed with both original version and with your patch.

idismmxiv commented 9 years ago

Unfortunately I am not able to reproduce problem you have @am88b. I do not have proper environment to test with and testing with Core (http://www.nrl.navy.mil/itd/ncs/products/core) does not seem to bring the issue up.The delay I had was this few seconds before next "cache miss" triggered MFC creation. As a short summary you have:

1 router running pimd configured with static RP address (tap address?)
1 leaf node sending multicast
1-2 leaf nodes receiving multicast (IGMPv2 or IGMPv3?) Couple of switches, but I suppose there can not be anything special if Cisco works properly as a router in same environment (or was it only in @ryangubele who tested Cisco as well?).

am88b commented 9 years ago

My environment:

1 router running pimd configured with static RP address of itself (p.s: I am not aware what is tap address?)
1-n streamers in different vlans (241, 242) sending multicast... but I think it doesn't matter, we can consider only 1
I think I have set 'IGMPv2' everywhere where possible to get consistent environment
1-n nodes (in vlan 201 and 0 (untagged)) receiving multicast with following command: "time socat UDP4-RECV:1243,ip-add-membership=237.0.0.23:eth0 -|file -"

The command I am using simply waits for the stream to appear and then exits, so with 'time' I can see how long it took for each run.

Yes, it wasn't me who was testing with Cisco router. For me there are couple of switches between (Dell and Ubiquiti) but I as far as I have researched they should be OK and they are in 'broadcast all multicast' mode so they should not matter.

Here is the current pimd.conf I use for the main router (note that I have insanely low timers and priorities everywhere to get fast results, but I am not aware they have make any difference anyway... I have tried the defaults too)

default-route-distance 0 default-route-metric 0 hello-interval 30

phyint eth1:5 enable igmpv2 dr-priority 4294967294 phyint eth1.201:6 enable igmpv2 dr-priority 4294967294 phyint eth1.241:696 enable igmpv2 dr-priority 4294967294 phyint eth1.242:698 enable igmpv2 dr-priority 4294967294

igmp-query-interval 12 igmp-querier-timeout 42

bsr-candidate M.Y.I.P priority 0 rp-candidate M.Y.I.P time 10 priority 0 group-prefix 224.0.0.0 masklen 4 rp-address M.Y.I.P 224.0.0.0/4 spt-threshold packets 0 interval 1

idismmxiv commented 9 years ago

I tried similar kind of configuration using CORE as test environment, but no luck. So, unfortunately I can not help here. Btw, bsr-candidate and rp-candidate attributes are unnecessary as you are using static rp that is defined by rp-address attribut. But, that should no cause problems you are facing.

am88b commented 9 years ago

I had some interesting progress. I made hack to the line 876 of route.c to always be true "if (1==1)" and after that multicast starts to be moving immediately (~0-4 seconds)!

Also, now when I start pimd, in about 60 seconds, I see "(10.222.4.45, 237.0.0.48) Iif: eth1.241" with "ip mroute" even without any client trying to connect. And when clients connect, they get their stream immediately and even if the last client disconnects, the "ip mroute" does not show "Iif: unresolved" like I mentioned in previous posts.

However, I am aware that with a hack like that I probably broke something along the way...

troglobit commented 8 years ago

PR #67 is very interesting for all concerned in this issue. Please have a look, if you can!

kopren commented 8 years ago

We have found that /net/ipv4/ipmr.c Linux kernel code limits the maximum number of unresolved multicast groups in the Multicast Routing Table (MRT) to 10 or less within the ipmr_cache_unresolved function. The unresolved cache is flushed and what seems to be a statically random set of 10 unresolved multicast groups are rebuilt in the MRT every second. We are working in an environment where we have a great deal more than 10 multicast source streams and only a few are actually joined on other interfaces. This results in a great deal more than 10 possible unresolved entries and many cache_misses. We have recompiled the Linux kernel using the following ipmr.c patch (removes the 10 unresolved cache limit) and can now quickly create multicast routes from the unresolved entries.

--- ipmr.c_org  2016-09-13 15:36:44.000000000 +0000
+++ BUILD/kernel-2.6.32-642.4.2.el6/linux-2.6.32-642.4.2.el6.mc_unresolv.x86_64/net/ipv4/ipmr.c 2016-09-12 22:27:44.447442385 +0000
@@ -979,8 +979,7 @@
         *  Create a new entry if allowable
         */

-       if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
-           (c = ipmr_cache_alloc_unres()) == NULL) {
+       if ((c = ipmr_cache_alloc_unres()) == NULL) {
            spin_unlock_bh(&mfc_unres_lock);

            kfree_skb(skb);

ryangubele commented 8 years ago

@kopren: Thanks for this information. I made the change to my kernel (I actually increased the limit rather than abolishing it) and at first blush, it seems to have solved the issue.

I will need to rebuild the rest of my lab and test further to confirm for certain. At the moment, I have only listeners off the pimd -- I don't have listeners behind a second pimd, which is the target scenario. I will report back hopefully in a few days.

@troglobit: Note that so far, I'm seeing good results even without PR #67. It may be that this was never a pimd issue to begin with, and igmpproxy works because it's mechanism is a bit different or because of it's different position in my network (ie., it isn't exposed to hundreds of active groups like pimd is).

kopren commented 8 years ago

Great to hear your initial results were good. I opted to remove the limit to eliminate a chance of a denial of service possibility. Given what I have observed, if we leave the unresolved limit at 10 or some finite number, a user on the LAN would only need to output a number of low rate multicast groups, say 2 or 3 times whatever the limit is, and it would lead to some indeterminate time for the multicast routes to be created by pimd (10s of seconds to minutes). I have tested the unlimited code in our lab with over 508 unresolved entries on my RHEL machine and it seems to work fine (top reports 65% idle with around 25-1Mbit/sec multicast streams routed, out of the 508 unresolved available). I have attached my test configuration if it helps. MC_ENV.pdf

idismmxiv commented 8 years ago

This is interesting. So the problem only occurs if there are more than 10 mc streams initiated simultaneously? And in that case it is random streams that get delayed?

ryangubele commented 8 years ago

@idismmxiv I can't say for sure. I can confirm it's random groups that get delayed, and the length of the delay is random. Sometimes it was pretty quick, other times, minutes. But I have never actually exposed pimd to anything less than about 80 active groups, except of course for the downstream pimds, so I have no basis for comparison.

What I have observed so far is that, if I hook a VPN client to SRC and send IGMP Group Membership reports for, say, 30 groups, those groups all have the routes built within 100ms of each other. Previously, what would happen is, initially, nothing would be built on src, then a few routes would appear, then a few more, then a few more, and somewhere between 5 and 15ish minutes later, the final routes would appear to make 30 total entries. What I haven't yet tested is:

What is the time between an IGMP Group Membership Report and the corresponding routes actually being created (or, alternatively, first packets actually arriving)? Just because all 30 routes are created within 100ms of each other, doesn't mean that it didn't still take, say, 90 seconds to create the entries. But in the (limited) testing I've done so far, it's never taken more than 30 seconds for all of my multicast routes to reappear from a bounce of pimd, which is a huge improvement over 5 or 10 or 15 minutes to recover from a restart. That said, I just did a quick, dirty, very unscientific test, and it does still seem like it's taking 10 or 15 seconds to get a route added, which is much too slow for my use case but still considerably better than it was. Perhaps your PR will resolve this, and my life will be dandy.
Whether this has any effect on PIM Joins, ie., the scenario where the IGMP Group Membership Reports are on n4, being picked up by a downstream pimd that's the DR for that network, and then joined towards SRC. I need to rebuild my lab to do this. I kind of expect that it will help. Even though they are different code paths in pimd, they pretty much end up doing basically the same thing with basically the same kernel interfaces. At least, so far as I understand. Please correct me if I'm wrong.

ka9q commented 6 years ago

Was this problem ever resolved? I found this discussion last night while trying to solve the same problem in my situation. Even though this discussion is several years old, I see that the hard-coded limit of 10 is still present in the latest Linux kernel (4.17.8). So I submitted the attached bug report to the kernel maintainers:

bug.txt

troglobit commented 6 years ago

@ka9q Thanks Phil for reporting this to the Linux kernel devs, and also very well written I must say!

bewing commented 4 years ago

@ka9q It looks like the patch got kind of ignored -- were you planning to resubmit?

troglobit commented 4 years ago

A fix to the hard-coded limit was committed to Linux on Sep 6, 2019. It seems to be available as of Linux 5.4.

Hence, I think we can now finally close this one.

bewing commented 4 years ago

Another possible tweak?

Perhaps pimd (or the kernel) can maintain a fastdrop/don't care list (cache miss -> add eBPF drop rule, or other method to drop/ignore frame) and resolve that entry, and then scan the drop list when adding a new group to remove, if necessary? That would pretty effectively manage the cache size, and would be necessary if being flooded multicast with no current output interfaces (another PIMd on the LAN forwarding traffic).

troglobit / pimd

pimd slow to add MFC entries #58