Open ryangubele opened 9 years ago
I can't seem to attach the full debug logs, and they are too long to paste as a comment, so here are some of the relevant sections. Let me know how to get the full logs to you if you need them.
On PI1:
Initial group report:
15:15:00.949 accept_group_report(): igmp_src 10.153.4.131 ssm_src 239.0.1.10 group 239.0.1.10 report_type 22 15:15:00.949 accept_group_report(): al_pv=2
Takes 2 minutes 25 seconds to start forwarding:
15:17:25.251 move_kernel_cache: SG 15:17:25.251 Added kernel MFC entry src 10.153.255.2 grp 239.0.1.10 from tap0 to eth1
First mention of the group on src1:
15:14:44.970 Cache miss, src 10.153.255.2, dst 239.0.1.10, iif 1 15:14:44.970 findroute: Group not found. Return the (,_,RP) entry 15:14:44.970 Switch shortest path (SPT): src 10.153.255.2, group 239.0.1.10 15:14:44.970 create group entry, group 239.0.1.10
PIM join:
15:15:00.483 Received PIM JOIN/PRUNE from 10.153.251.10 on tap1 15:15:00.483 Received PIM JOIN from 10.153.251.10 to group 239.0.1.10 for multicast source 10.153.255.1 on tap1
MFC, 5 seconds later:
15:15:05.002 Added kernel MFC entry src 10.153.255.2 grp 239.0.1.16 from eth1 to tap0 15:15:05.002 Cache miss, src 10.153.255.2, dst 239.0.1.51, iif 1
So the first pimd took 5 seconds to start forwarding, compared to milliseconds on Cisco gear. The second took minutes. It seems to be much faster after a fresh start (which I did to capture the debug logs), but if I were to leave these running with debug logging for some time, it would be easy to spot examples of each hop taking 20 seconds or more to start forwarding.
Hi @ryangubele! Yeah, I've noticed this slugghishness myself. Have no clue really what's up ...
I'm acting more like a focal point of patches, making sure to integrate them and releasing semi-regularly. My intimate knowledge of the internals of pimd is unfortunately not good yet. Sorry!
So, if you figure out where (in the code) the bottle neck is I'm more than happy to look at it, or ever better merge a patch for it.
Hello,
I was just going to post about the same issue. Also I found similar problem here (no solution): https://www.ietf.org/mail-archive/web/pim/current/msg02305.html
My lab environment is the following:
[MC streamer] -> [SW] -> [router] -> [another SW] -> [MC client]
If [MC client] asks for stream, it will take up to a minute or more before [router] actually creates the route with "Added kernel MFC entry...". The time it takes for the multicast stream to be started to flow seems totally random for me.
Router is with Debian Kernel 3.16.0-0.bpo.4-amd64 and pimd-2.3.0
I tried to play around with different timeout/period values (both in pimd.conf and in the source header files), and pretty much tried everything that made sense to me but didn't get better solution. Also I tried to debug the code but so far I only learned that the route is not created because of !VIFM_ISEMPTY(mrt->oifs) check fails at (route.c:875)
I'm more perl/python and less C, but I can dig around. Would definitely appreciate anyone else's assistance on this though.
The more the merrier! ;)
If someone comes up with an idea, we can start messing about on a branch. I can help to the best of my abilities (at least I know C fairly well after 20 years now ;) to facilitate, but I'm fairly limited in what I can test ... so maybe we can complement each other?
We can use this issue for synching, I think you can even email replys ...
I would be glad to test and debug, but right now I am out of ideas what to do.
Some throughts:
1) I tried to check the debug logs and understand what is the difference between simply going through the "process_cache_miss()" function and between actually creating the route, but didn't see any pattern.
2) What is "mrt->oifs" ? (mc output interfaces?) and why it might be that it is empty at this point and ok at some time later (when the route is created)?
3) Is there any "#define" parameter or "/proc" value to try changing maybe to make it generally 'snappier' or make some timeouts smaller (possibly even unsuitable for live environment) to see where the problem might be?
@am88b:
1.That's a bit of a mystery to me as well, but I think process_cache_miss()
is where you
arrive when you receive an unknown multicast stream. I.e., the kernel alerts you that
"Hey, there's some multicast coming in here ... what now?"
mrt->oifs
is, as you rightly guessed, the outbound interfaces for a multicast route "mrt".pimd
is very stand alone and was made for generic UNIX (*BSD)
systems where /proc
doesn't even exist. Messing about with the -d
debug flags may help though.pim_proto.c:receive_pim_join_prune()
The bare necessities of getting PIM-SM/SSM running on a set of routers is that the unicast routing table is setup properly. I.e., so that end nodes (multicast sender and receiver) can ping each other. Also, the sender of the multicast must have a TTL >1 ... the default in most operating systems is to set TTL=1 when the destination IP address is a multicast group.
I took a quick look into this. Not sure if this is the case, but I throw my two cents in anyway. Looks that MFC entry is added only after "cache miss". Once cache miss occurs and there is no PIM JOINS for that multicast group, given "cache miss" is more or less ignored by pimd. Now, ipmr.c in Linux kernel buffers packets matching to that cache miss (unresolved). It will not send another cache miss to pimd, if one is already sent. There seems to be buffer for few packets and once that buffer gets filled, rest are dropped. This unresolved buffer is cleared by timer (ipmr_expire_process). I couldn't figure out how often that buffer is cleared. Once buffer is cleared, one cache miss is sent to pimd again.
So, one option for this random delay could this cache miss process from kernel to pimd.
@troglobit: I checked, unicast routing table is OK - I checked ping both ways. Also MC traffic has TTL=16. (IGMP traffic has TTL=1, but I think this is how it is supposed to be)
@idismmxiv: Thank you for the insight. It sounds promising. Are you planning to take a deeper look? Is there somehow I can be of help to debug this?
@idismmxiv Good catch! :smiley: Hopefully someone can have a look at it to see if it has any bearing on this issue.
@am88b Yep, IGMP traffic should have TTL=1.
Continued playing with this. I only tested config: [MC streamer] -> [SW] -> [router] -> [another SW] -> [MC client] where router was RP and BSR as well. I got it working fast by adding MFC when new leaf is joining multicast stream. Here is pure hack for you to test if you want and to think actual fix. I didn't want to add new param to change_interfaces function right a way as it should be added so many places. I suppose there are other places as well where similar MFC creation should be added.
diff --git a/route.c b/route.c index f76a933..8256a02 100644 --- a/route.c +++ b/route.c @@ -237,6 +237,8 @@ int set_incoming(srcentry_t *src, int type) }
+static uint32_t group_for_testing; + /*
source
is not used. Will be used with IGMPv3 where@@ -491,6 +496,7 @@ int change_interfaces(mrtentry_t mrt, mrtentry_t srcs; mrtentry_t mwc; mrtentry_t mrp;
}
+ } if (fire_timer_flag == TRUE) @@ -872,6 +888,10 @@ static void process_cache_miss(struct igmpmsg *igmpctl)
@idismmxiv Thank you for the patch! Unfortunately I didn't notice any change whatsoever... I was probably testing the wrong thing, but anyway it didn't fix the original issue I had...
However, during testing this patch and reading your very useful comments, I started to think and I discovered something: the creation of MFC entry is slow only in case "Iif" is unresolved.
Eg, something like that:
1) pimd is started, "ip mroute" shows on router: (10.222.4.45, 237.0.0.37) Iif: unresolved 2) first client attaches (took about 60 seconds), "ip mroute" shows: (10.222.4.45, 237.0.0.23) Iif: eth1.241 Oifs: eth1.201 3) second client attaches (less than 1 second!), "ip mroute shows" (10.222.4.45, 237.0.0.23) Iif: eth1.241 Oifs: eth1.201 eth1.202 4) second client leaves and rejoins later - still takes less than 1 second! 5) but if both first and second client leave: (10.222.4.45, 237.0.0.37) Iif: unresolved 6) ...then now when any client tries to join, it again takes about 60 seconds
The same is observed with both original version and with your patch.
Unfortunately I am not able to reproduce problem you have @am88b. I do not have proper environment to test with and testing with Core (http://www.nrl.navy.mil/itd/ncs/products/core) does not seem to bring the issue up.The delay I had was this few seconds before next "cache miss" triggered MFC creation. As a short summary you have:
My environment:
The command I am using simply waits for the stream to appear and then exits, so with 'time' I can see how long it took for each run.
Yes, it wasn't me who was testing with Cisco router. For me there are couple of switches between (Dell and Ubiquiti) but I as far as I have researched they should be OK and they are in 'broadcast all multicast' mode so they should not matter.
Here is the current pimd.conf I use for the main router (note that I have insanely low timers and priorities everywhere to get fast results, but I am not aware they have make any difference anyway... I have tried the defaults too)
default-route-distance 0 default-route-metric 0 hello-interval 30
phyint eth1:5 enable igmpv2 dr-priority 4294967294 phyint eth1.201:6 enable igmpv2 dr-priority 4294967294 phyint eth1.241:696 enable igmpv2 dr-priority 4294967294 phyint eth1.242:698 enable igmpv2 dr-priority 4294967294
igmp-query-interval 12 igmp-querier-timeout 42
bsr-candidate M.Y.I.P priority 0 rp-candidate M.Y.I.P time 10 priority 0 group-prefix 224.0.0.0 masklen 4 rp-address M.Y.I.P 224.0.0.0/4 spt-threshold packets 0 interval 1
I tried similar kind of configuration using CORE as test environment, but no luck. So, unfortunately I can not help here. Btw, bsr-candidate and rp-candidate attributes are unnecessary as you are using static rp that is defined by rp-address attribut. But, that should no cause problems you are facing.
I had some interesting progress. I made hack to the line 876 of route.c to always be true "if (1==1)" and after that multicast starts to be moving immediately (~0-4 seconds)!
Also, now when I start pimd, in about 60 seconds, I see "(10.222.4.45, 237.0.0.48) Iif: eth1.241" with "ip mroute" even without any client trying to connect. And when clients connect, they get their stream immediately and even if the last client disconnects, the "ip mroute" does not show "Iif: unresolved" like I mentioned in previous posts.
However, I am aware that with a hack like that I probably broke something along the way...
PR #67 is very interesting for all concerned in this issue. Please have a look, if you can!
We have found that /net/ipv4/ipmr.c Linux kernel code limits the maximum number of unresolved multicast groups in the Multicast Routing Table (MRT) to 10 or less within the ipmr_cache_unresolved function. The unresolved cache is flushed and what seems to be a statically random set of 10 unresolved multicast groups are rebuilt in the MRT every second. We are working in an environment where we have a great deal more than 10 multicast source streams and only a few are actually joined on other interfaces. This results in a great deal more than 10 possible unresolved entries and many cache_misses. We have recompiled the Linux kernel using the following ipmr.c patch (removes the 10 unresolved cache limit) and can now quickly create multicast routes from the unresolved entries.
--- ipmr.c_org 2016-09-13 15:36:44.000000000 +0000
+++ BUILD/kernel-2.6.32-642.4.2.el6/linux-2.6.32-642.4.2.el6.mc_unresolv.x86_64/net/ipv4/ipmr.c 2016-09-12 22:27:44.447442385 +0000
@@ -979,8 +979,7 @@
* Create a new entry if allowable
*/
- if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
- (c = ipmr_cache_alloc_unres()) == NULL) {
+ if ((c = ipmr_cache_alloc_unres()) == NULL) {
spin_unlock_bh(&mfc_unres_lock);
kfree_skb(skb);
@kopren: Thanks for this information. I made the change to my kernel (I actually increased the limit rather than abolishing it) and at first blush, it seems to have solved the issue.
I will need to rebuild the rest of my lab and test further to confirm for certain. At the moment, I have only listeners off the pimd -- I don't have listeners behind a second pimd, which is the target scenario. I will report back hopefully in a few days.
@troglobit: Note that so far, I'm seeing good results even without PR #67. It may be that this was never a pimd issue to begin with, and igmpproxy works because it's mechanism is a bit different or because of it's different position in my network (ie., it isn't exposed to hundreds of active groups like pimd is).
Great to hear your initial results were good. I opted to remove the limit to eliminate a chance of a denial of service possibility. Given what I have observed, if we leave the unresolved limit at 10 or some finite number, a user on the LAN would only need to output a number of low rate multicast groups, say 2 or 3 times whatever the limit is, and it would lead to some indeterminate time for the multicast routes to be created by pimd (10s of seconds to minutes). I have tested the unlimited code in our lab with over 508 unresolved entries on my RHEL machine and it seems to work fine (top reports 65% idle with around 25-1Mbit/sec multicast streams routed, out of the 508 unresolved available). I have attached my test configuration if it helps. MC_ENV.pdf
This is interesting. So the problem only occurs if there are more than 10 mc streams initiated simultaneously? And in that case it is random streams that get delayed?
@idismmxiv I can't say for sure. I can confirm it's random groups that get delayed, and the length of the delay is random. Sometimes it was pretty quick, other times, minutes. But I have never actually exposed pimd to anything less than about 80 active groups, except of course for the downstream pimds, so I have no basis for comparison.
What I have observed so far is that, if I hook a VPN client to SRC and send IGMP Group Membership reports for, say, 30 groups, those groups all have the routes built within 100ms of each other. Previously, what would happen is, initially, nothing would be built on src, then a few routes would appear, then a few more, then a few more, and somewhere between 5 and 15ish minutes later, the final routes would appear to make 30 total entries. What I haven't yet tested is:
Was this problem ever resolved? I found this discussion last night while trying to solve the same problem in my situation. Even though this discussion is several years old, I see that the hard-coded limit of 10 is still present in the latest Linux kernel (4.17.8). So I submitted the attached bug report to the kernel maintainers:
@ka9q Thanks Phil for reporting this to the Linux kernel devs, and also very well written I must say!
@ka9q It looks like the patch got kind of ignored -- were you planning to resubmit?
A fix to the hard-coded limit was committed to Linux on Sep 6, 2019. It seems to be available as of Linux 5.4.
Hence, I think we can now finally close this one.
Another possible tweak?
Perhaps pimd (or the kernel) can maintain a fastdrop/don't care list (cache miss -> add eBPF drop rule, or other method to drop/ignore frame) and resolve that entry, and then scan the drop list when adding a new group to remove, if necessary? That would pretty effectively manage the cache size, and would be necessary if being flooded multicast with no current output interfaces (another PIMd on the LAN forwarding traffic).
Hi @troglobit,
I'm working with pimd with the following topology:
Four networks: n1, n2, n3, n4. n1 is a management LAN. Multicast sources are on n2. n3 is an OpenVPN TAP tunnel. Multicast receivers are on n4.
RP1 is the rendezvous point, a Cisco router. It is connected to n1 and n2. SRC is the OpenVPN server, running Ubuntu 14.04. It is connected to n1, n2, and n3. PI1 is the client, running Raspbian. It is connected to n3 and n4.
SRC and PI1 are running pimd. RP1 is the Rendezvous Point.
Unicast routing protocol is OSPF and appears to be stable.
n1 is 10.153.1.0/24. n2 is 10.153.255.0/24. n3 is 10.153.251.0/24. n4 is 10.153.4.128/27. n5 is 192.168.88.0/24. n5 is the network that pi1 is connected to for connectivity, but it's not part of the multicast routing topology as the router that connects n5 to n1 is not pim capable, hence the tunnels. It's part of the OSPF area, but the connected route takes precedence over the learned one on PI1 and I've installed a pinhole route to the VPN server. It's a bit of a hack, but these aren't meant to run on n5 anyway, it's just a testing area.
pimd seems to be working properly, except that it takes an extremely long time (anywhere from 5 seconds to 60 seconds) to add the MFC entry to the kernel routing table and start forwarding packets. With two pimd's between n4 and n2, that makes for extremely slow and arbitrary starts to multicast streams. Replacing SRC and PI1 with Cisco gear, the time is on the order of milliseconds (I haven't measured it enough, but it's fast enough that I can't count it in my head and it feels "instant," which implies it's probably less than 100ms).
Everything works fine, so long as I'm willing to wait a long time for my stream to start. igmpproxy works fine and has response speeds similar to the Cisco gear (with occasional blips of up to a few seconds that the Cisco's don't experience), so this seems to be an issue with pimd specifically.
From the debug log, it looks like pimd is resolving everything to the correct interfaces and is building the correct route internally. It just takes too long to update the MFC. The receiving devices often time out before it happens and leave/join, which causes the whole process to start over, and can result in a stream taking many minutes to start.
This is currently a static RP setup with only RP1 as the RP for all the streams. The streams are all in 239.0.1.0/24.
pimd version 2.3.0.