troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
194 stars 86 forks source link

pimd sometimes doesn't recover after link drops and comes back #183

Open masaraksh79 opened 3 years ago

masaraksh79 commented 3 years ago

Running v2.3.2 patched (with fixes of #79 and #137), we are experiencing an issued with pimd not recovering after a radio link goes down (radio link is between Radio1 and Radio2). It does not happen every time, but definitely after few link downs and ups.

We would really appreciate if you would had any clue to solution of this issue since it has a noticeable affect on our multicast. Our network looks like this (our pimd is only running on Radio1 and Radio2):

Screen Shot 2020-12-17 at 14 53 35

Note: in logs our comments are inside # lines for easier reading.

################################################################################ We have captured highest log level logs with pimd in both scenarios. First when it does not recover pimd-no-recover.txt (it recovers only if pimd is restarted) and second when it does recover pimd-recover.txt . At the beginning of logs we added timestamps of when link was put down and up (+/- few seconds).

We have noticed one message which appears in a non-recover scenario and does not when pimd manages to recover:

... 00:10:54.043 delete_mrtentry_all_kernel_cache: SG 00:10:54.044 Removed MFC entry src 10.200.55.101, grp 239.0.4.1 ... ################################################################################

In a good state, pimd status on each side of the link is as follows:

pimd-logs.txt

Please let me know if you require any additional info. We have a setup where the problem is quite reproducible, so we could try things out have you suggested.

Thanks ahead!

Best regards Yakir

troglobit commented 3 years ago

I hope you understand that this is incredibly hard for me to help you with. Working on this project has been a hobby for the last 10+ years, nobody pays me, and even if they did I don't have the time anyway since $DAYJOB takes up 110% of my waken time.

The only thing I can do is ask you to please test the latest GIT master branch and see if that works better.

A few years back I got patches to support point-to-point links from a company. Some of them have been merged, look for "Ventus" in the GIT log. Here's the rest, from an old git stash, that probably don't apply clean on anything, but should provide an idea of what's left: ventus-point-to-point_patch.txt

Would be great if you could help out test both latest master and this remaining patch set, that would increase the chances greatly of having a new release out during my time off over the holiday season.

masaraksh79 commented 3 years ago

Fully understand, we're in similar boats in this sense heh, we shall try out the latest. I'll update the ticket once done. Cheerios!

troglobit commented 3 years ago

Thank you! <3

masaraksh79 commented 3 years ago

While we're looking at the patches, we are trying to fire in all cannons. Dear @troglobit or other devs involved in pimd, we are looking for support in our attempt to resolve this issue. My current employer has hired a Linux dev to debug this with no past experience with pim or the project and is willing to pay to get consultancy from people who had good relevant expertise. Please contact me on my working email yakir.matusovsky@mimomax.com if you're interested to discuss this short term opportunity. Cheers