pimd at large scale: 20 to 100+ multicast groups

stduong commented 4 years ago

Pimd version 2.3.2 Os: FreeBSD 11.2

I a m running pimd on my router and start listening from 0 til reaching 100 streams in continuous, incrementing 20 by 20 listeners every time. it seems that pimd is having trouble updating the Multicast forwarding table and keeping track of the listeners membership. I see multicast stop being forwarded to some listeners temporarily , when I inspect the multicast table, it shows that some (S,G) got removed during multicast forward table refresh and sometimes result into sending a pim prune message which seems to cutoff upstream stream.

IPv4 Multicast Forwarding Table with 20 listeners Origin Group Packets In-Vif Out-Vifs:Ttls X.X.X.37 M.M.M.39 47745 1 2:1 X.X.X.49 M.M.M.47 115283 1 2:1 X.X.X.25 M.M.M.25 33436 1 2:1 X.X.X.13 M.M.M.25 42026 1 2:1 X.X.X.141 M.M.M.13 29772 1 2:1 X.X.X.189 M.M.M.21 57141 1 2:1 X.X.X.225 M.M.M.24 41690 1 2:1 X.X.X.61 M.M.M.48 45299 1 2:1 X.X.X.37 M.M.M.46 50422 1 2:1 X.X.X.61 M.M.M.41 32584 1 2:1 X.X.X.213 M.M.M.23 28296 1 2:1 X.X.X.25 M.M.M.45 33506 1 2:1 X.X.X.189 M.M.M.24 29403 1 2:1 X.X.X.49 M.M.M.40 27592 1 2:1 X.X.X.13 M.M.M.19 31261 1 2:1 X.X.X.85 M.M.M.43 29526 1 2:1 X.X.X.25 M.M.M.38 52282 1 2:1 X.X.X.141 M.M.M.37 55599 1 2:1 X.X.X.13 M.M.M.44 35924 1 2:1 X.X.X.177 M.M.M.20 43175 1 2:1

When scaling to 40, it addup to 40 (S,G) Origin Group Packets In-Vif Out-Vifs:Ttls X.X.X.201 M.M.M.11 7 1 2:1 X.X.X.49 M.M.M.47 0 1 2:1 X.X.X.37 M.M.M.39 9 1 2:1 X.X.X.153 M.M.M.6 15 1 2:1 X.X.X.85 M.M.M.22 7 1 2:1 X.X.X.25 M.M.M.65 7 1 2:1 X.X.X.25 M.M.M.25 9 1 2:1 X.X.X.25 M.M.M.21 13 1 2:1 X.X.X.141 M.M.M.1 13 1 2:1 X.X.X.141 M.M.M.13 0 1 2:1 X.X.X.189 M.M.M.21 0 1 2:1 X.X.X.13 M.M.M.25 10 1 2:1 X.X.X.153 M.M.M.8 13 1 2:1 X.X.X.25 M.M.M.4 7 1 2:1 X.X.X.61 M.M.M.48 10 1 2:1 X.X.X.177 M.M.M.12 5 1 2:1 X.X.X.225 M.M.M.24 37 1 2:1 X.X.X.97 M.M.M.23 13 1 2:1 X.X.X.189 M.M.M.10 9 1 2:1 X.X.X.141 M.M.M.14 14 1 2:1 X.X.X.37 M.M.M.46 10 1 2:1 X.X.X.13 M.M.M.13 7 1 2:1 X.X.X.141 M.M.M.5 7 1 2:1 X.X.X.61 M.M.M.41 30 1 2:1 X.X.X.73 M.M.M.49 9 1 2:1 X.X.X.141 M.M.M.7 14 1 2:1 X.X.X.213 M.M.M.23 35 1 2:1 X.X.X.165 M.M.M.2 10 1 2:1 X.X.X.85 M.M.M.50 9 1 2:1 X.X.X.25 M.M.M.45 0 1 2:1 X.X.X.189 M.M.M.24 0 1 2:1 X.X.X.49 M.M.M.40 0 1 2:1 X.X.X.13 M.M.M.19 0 1 2:1 X.X.X.85 M.M.M.43 0 1 2:1 X.X.X.25 M.M.M.38 0 1 2:1 X.X.X.25 M.M.M.10 8 1 2:1 X.X.X.165 M.M.M.9 7 1 2:1 X.X.X.141 M.M.M.37 43 1 2:1 X.X.X.177 M.M.M.20 0 1 2:1 X.X.X.13 M.M.M.44 44 1 2:1

but after few seconds drop to 30 before coming back to 40 again Origin Group Packets In-Vif Out-Vifs:Ttls X.X.X.37 M.M.M.39 27427 1 2:1 X.X.X.49 M.M.M.47 64686 1 2:1 X.X.X.153 M.M.M.6 19361 1 2:1 X.X.X.85 M.M.M.22 17748 1 2:1 X.X.X.25 M.M.M.25 20436 1 2:1 X.X.X.141 M.M.M.1 18824 1 2:1 X.X.X.25 M.M.M.21 17749 1 2:1 X.X.X.13 M.M.M.25 26351 1 2:1 X.X.X.189 M.M.M.21 33649 1 2:1 X.X.X.153 M.M.M.8 17748 1 2:1 X.X.X.177 M.M.M.12 15843 1 2:1 X.X.X.61 M.M.M.48 26351 1 2:1 X.X.X.225 M.M.M.24 27196 1 2:1 X.X.X.97 M.M.M.23 16377 1 2:1 X.X.X.189 M.M.M.10 23663 1 2:1 X.X.X.37 M.M.M.46 27964 1 2:1 X.X.X.141 M.M.M.5 17748 1 2:1 X.X.X.13 M.M.M.13 20437 1 2:1 X.X.X.73 M.M.M.49 26274 1 2:1 X.X.X.61 M.M.M.41 20975 1 2:1 X.X.X.141 M.M.M.7 20436 1 2:1 X.X.X.213 M.M.M.23 25276 1 2:1 X.X.X.165 M.M.M.2 15842 1 2:1 X.X.X.85 M.M.M.50 27426 1 2:1 X.X.X.189 M.M.M.24 17749 1 2:1 X.X.X.49 M.M.M.40 24201 1 2:1 X.X.X.85 M.M.M.43 26889 1 2:1 X.X.X.25 M.M.M.10 17748 1 2:1 X.X.X.141 M.M.M.37 34415 1 2:1 X.X.X.13 M.M.M.44 33341 1 2:1

when running wireshark to inspect pim packet upstream, I can see pim prune the 10 missing multicast groups.

wireshark

from upstream multicast I see a deep graph

what makes pimd prune some multicast groups when adding 20+ new listeners ? is there a limitation from pimd ?

troglobit commented 4 years ago

Huh, no idea why that would happen really. Sounds like a bug :-(

We have quite a few (yet unreleased) fixes on master that could perhaps be interesting for you to test. The README has instructions on how to build from GIT, but let me know if you run into any problems.

stduong commented 4 years ago

Thanks for the suggestion, I recompiled master branch and test it out, but found a regression where the streams cannot stay stable due to pimd missing some PIM HELLO messages from the upstream PIM router. The set up use default PIM config which is ( PIM HELLO interval = 30s, and PIM router expiry = 105s) . I currently just run 20 multicast groups. Pimd seems stuck in updating or reading route socket and is missing PIM HELLO packets sent by the upstream Pim router, which eventually ends up by deleting the PIM neighbor and deleting all the mrt entries from that hop.

here is the captured log showing the last PIM hello processed by pimd and the the timeout of the PIM neighbor

here is the packet capture of the pim interface where we can see that PIM HELLO packets are sent consistently every 30s

it seems from the log that pimd is doing some consistent read and fail the first try

Kernel busy, retrying (1/3) routing socket read in one sec ...

do you know if I can tune the kernel to prevent that ? or is it a normal behavior you are seeing as well. I am running only 20 streams on a fairly powerful machine so I doubt it is resource limitation.

Thanks for your help.

troglobit commented 4 years ago

Hi, sorry for the late reply, been extremely busy with $DAYJOB for a couple of weeks :-/

Really looks you've stumbled upon a bug. I've not tested the routing socket (*BSD backend) code in quite some time, so it's very likely to have some problems. For instance, I know FreeBSD changed the kernel behavior of raw sockets (implicitly did some ntohl() operations of packet contents before FreeBSD 10 or 11, which confused a lot of applications. The pimd code is quite old and took such stuff into consideration, but with a newer FreeBSD this old pimd code may not be compatible.)

Dunno how to do *BSD kernel tweaking, sorry :(

However, I hope to be able to circle back to pimd again real soon and then start up my (virtual) test lab with a few *BSD machines. Can't make any promises though.

troglobit commented 4 years ago

Could really need some help debugging this problem. Someone with a newer setup than FreeBSD 11.3 and some C coding skills ...

... My FreeBSD 11.3 setup causes a kernel crash, which seems to be a regression in the kernel, unrelated to pimd.

stduong commented 4 years ago

I did not try with version 11.3, what kind of error do you see ?

troglobit commented 4 years ago

I had two FreeBSD 11.3 PIM routers, one of them had continuous kernel crashes somewhere in the pim receive/processing path followed by a reboot. Was only seen on the console so I have no logs unfortunately, but I noticed there were some regression fixes in the latest Subversion sources, which I assumed was released in 12.x

I started upgrading to 12.1 but ran out of both disk and time, so I've put this on hold for now. Anyone who can help out is very welcome to do so.

stormshield-damiend commented 3 years ago

Hi pimd dev,

I have an explanation for this:

on FreeBSD replies for RTM_GET are not synchronous
kernel reuse the incomming mbuf to reply and queue the packet using a netisr.

On FreeBSD HEAD you can see this in the followings functions: send_rtm_reply()which build the reply calls as it end rt_dispatch() https://github.com/freebsd/freebsd-src/blob/main/sys/net/rtsock.c#L1188

rt_dispatch() queue the packet to routing socket using netisr API https://github.com/freebsd/freebsd-src/blob/main/sys/net/rtsock.c#L2090

This means that if you read too fast you will not get the reply.

FreeBSD routed daemon present in base system uses a select() + timeout approach to correctly read the reply. https://github.com/freebsd/freebsd-src/blob/main/sbin/routed/main.c#L496

If you are interested i made a patch for pimd-2.3.2 which fixes the read + retry issue. The prune issue still occurs but less frequently according to our still running tests.

Damien

troglobit commented 3 years ago

Very interesting!

Thank you for taking the time to explain this, Damien. This could actually be the root cause of a lot of weird issues we've seen on FreeBSD. We're of course extremely interested in seeing your patch, whatever help we can get on testing/integration on the *BSDs is welcome since our own time and experience is very limited.

BR
Joachim

stormshield-damiend commented 3 years ago

Hi,

do you want a simple patch for 2.3.2 version (we will need to discuss some aspect of this patch like retry policy) ? or do you prefer that i submit a pull request for 3.0.0 which i am currently building on FreeBSD 11.3 and will later test ?

Damien

troglobit commented 3 years ago

I'd prefer the latest master branch, of course :)

I haven't seen your proposal yet, and we probably need to discuss the retry mechanism, a background timer update callback, or similar to handle the async nature of this without risking blocking the state machine on select()/poll(). So make a pull request and we can discuss from there.

stormshield-damiend commented 3 years ago

Hi,

ok i will do this next week, i am busy on other tasks at the moment.

Damien

troglobit commented 3 years ago

No problem, sounds great! <3

stormshield-damiend commented 3 years ago

Hi,

we did some test using master branch without and with the select() patch on FreeBSD 11.3.

Without the patch:

we have the "Kernel busy" issue and after some time daemon seem to stop doing anything else

With the patch:

everything work fine and we did not manage to reproduce the "prune issue" but it may just be luck ;)

Damien

stormshield-damiend commented 3 years ago

Hi,

I am busy @work at the moment, i will do a pull request on Friday morning, this patch will be done during my work time, is it possible to reference my company name in the commit once the review is done ?

I also identify a kernel crash on FreeBSD 11.x which according to the code also seem to be in HEAD. This happens when the daemon is shutdown and unregister in the kernel using MRT_DONE message, if some packet are send at the same time, there is a small race window where some packet may enter multicast forward path leading to an access of a hashtable that the MRT_DONE code is freeing. I will work on a kernel patch for this case and submit it to FreeBSD.

Damien

troglobit commented 3 years ago

@stormshield-damiend no problem, take your time! Of course you can reference your company in the commit message, as long as the commit message also explains (what, but more importantly) why a change is made. If you want to credit $DAYJOB at the end of that, that's 100%. (Copyright attribution is another question, but you didn't ask about that specifically, so I'm steering clear of that.)

Aha, yeah there's been some frustrating multicast routing instabilities with the FreeBSD kernel for few releases now. If you've managed to find the root cause, that's amazing! Thank you <3

stormshield-damiend commented 3 years ago

Hi,

Pull request published, please review the retry and timeout logic.

Damien

troglobit commented 3 years ago

Just had a quick glance and one question: why the three second select() timeout, why not one second or five? This is part of my worry that we probably need a slightly bigger change, where this is a callback in the event loop, because otherwise this change will block the event loop and potentially miss deadlines for, e.g., IGMP query interval.

stormshield-damiend commented 3 years ago

Hi,

i reused something similar to the previous code which was 3 * 1sec at max. But yes 1sec or less could be fine. If you want i could look at the average time in the select() in our performance test-bed ?

Damien

troglobit commented 3 years ago

If we don't integrate the fix into the main event loop (the other select(), in main.c). then we need to consider this a band-aid fix and try to reduce its impact on the overall timing. So if you could look into that it would be very appreciated!

stormshield-damiend commented 3 years ago

Hi,

sure i will look at this next week

Damien

troglobit commented 3 years ago

Awesome, thank you! <3

stormshield-damiend commented 3 years ago

Hi,

we did some test on our test bed and here are some information and results:

Product used is on the high end side

Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz / 6c
64G of memory

We use a traffic generator (spirent) using the following scenario:

1 multicast source that send 16 different flow for a total of 500 Mbps of traffic
16 listener are registered to receive on flow each
The product does not do any kind of flow replication (one input interface, one output interface), this is done later on a router

With this scenario, one CPU is saturated, other are around 50% average load.

I instrumented routesock.c code to measure the average wait time with this load: average values are very small (under 1msec), maximum measured was around 3msec.

So i think we could lower the timeout to a very small value, what do you think of ~50-100msec ?

By the way i made a FreeBSD kernel patch that fixes a nasty race (leading to a kernel crash) between multicast packet forward path and shutdown of multicast daemon, it works fine in our test-bed, i will submit it as a PR at FreeBSD for review.

Damien

troglobit commented 3 years ago

Great work! Let's give it a try with 100 msec, maybe change to LOG_WARNING for the case when select() times out. To catch any such cases, I mean. Possibly devices with less Oumph! than yours (embeded) may run into this and then it's better to have at least a hint of where to look. Maybe; "Timeout waiting for reply from routing socket for ..." is a more human friendly log message.

You can force push an update to your branch and the pull request will update automatically.

By the way, thank you for the hard work you've put in on this, much appreciated!

stormshield-damiend commented 3 years ago

Hi,

we did some test with 100ms as timeout value and had no issue. Pull request is updated.

Damien

troglobit commented 3 years ago

Great stuff, merged! Thank you again for the fix and hard work! <3

troglobit / pimd

pimd at large scale: 20 to 100+ multicast groups #152