Multicast routes don't always get their iif updated based on unicast routing changes

whariki commented 8 years ago

The network setup we have is:

S+Rn_.y = multicast source and receiver n, 
                    where .y is the least significant octet of the IPv4 address

Cisco_n = Cisco PIM Router
pimd_n = pimd PIM Router (pimd v2.3.2)

The RP is the router at the top of the picture.

              Cisco_1 --------------------  <BGP>
                 |                        |
S+R1_.165 -------| <ospf>                 |
                 |                     Cisco_2 ------ S+R2_.46
               Cisco_3                    |
                 |                        | <BGP>
S+R3_.117 -------| <ospf>                 |
                 |                        |
                 | <eth0>                 |
               pimd_4                     |
                 | <ndl0>                 |
                 |                        |
                 | <ospf>                 |
                 |                        |
                 | <ndl0>                 |
               pimd_5                     |
                 | <eth0>                 |
                 |                        |
                 | <ospf>                 |
S+R4_.101 -------|------- Cisco_6         |
                 |           |            |
                 |           | <ospf>     |
                 |           |            |
               pimd_7      Cisco_8        |
                 |           |            |
           <ospf>|        S+R5_.85        |
                 |                        |
               pimd_9                     |
                 |                        |
                 |<ospf>                  |
                 |                        |
              Cisco_10 --------------------

Reachability via OSPF is preferred.
The BGP link between Cisco_10 and Cisco_2 is preferred for all pimd_n routers.
The Cisco_1 router is the only router that prefers the direct BGP route to the S+R2_.46 network.

We have intermittent failure when the link between pimd routers is restored. i.e. If the link pimd_4 and pimd5 has been down long enough for the unicast routing to converge when it is restored the multicast routing doesn't always converge (fails approx. 50% of the time). Specifically the multicast receiver S+R3.117 might not see traffic from sources S+R4.101 and / or S+R5.85 and / or source S+R2_.46. The combination of which sources are lost from which receiver varies in the current setup.

Tracking through debug logs it appears that the change in the unicast routing is not always picked up.

The attached logs are from pimd_4 and pimd_5.

CoolSq_10.200.55.119_20160615-160927_pimd_4.txt SqCool_10.200.55.106_20160615-161131_pimd_5.txt

Events of interest are on the pimd_4 router log (I've bolded the timestamps of the symptoms):

23:13:58 (approx) link is restored
23:13:59.034 : first report of multicast packet being received over the wrong iif after the link restored.
23:13:59.232 : PIM neigbour restored with pimd_5
23:13:59.471 : first join/prune received from pimd_5
23:14:20.423 : it looks like the unicast_routing_timer has triggered and the various group routes are being tested. Both the sources .85 and .101 return a rpf of vif 1, but you can see by comparing the previous and next summaries that .85 has changed Incoming : I.. to Incoming : I.. while .101 remains Incoming : I..
23:14:40 : it looks like the source .101 has the incoming interface updated correctly but there are still subsequent reports of the multcast coming in on the wrong interface 23:14:57.732 Wrong iif: src 10.200.55.101, dst 239.0.4.1, iif 1

It is as if the "change_interfaces" call on line 641 of timer.c isn't actually causing the change to be applied to the group, maybe it needs the force change flag set?

troglobit commented 8 years ago

Definitely something fishy going on there, well spotted! Not sure what's going on, but I had a few pull requests regarding PIM Join handling pending that now have been merged to the GIT master. In case you want to try it out ...

I'm setting up a CORE network now to see if I can reproduce your problem, can't make any promises though. So anything you can dig up or patch/try yourself is most welcome!

masaraksh79 commented 8 years ago

Hey. Have now tried the code changes from git-master on the system that @whariki has mentioned and found out that unfortunately these didn't fix the issue.

troglobit commented 8 years ago

OK, thanks for the followup @masaraksh79, sorry it didn't help!

This is an awesome bug report, but I'll have to be frank with you guys. First of all I'm on vacation, second I'm maintaining pimd as a bit of a side project, or hobby. The whole project relies on features and bug fixes provided by the community. This repository at GitHub is more of a focal point for that development.

Anyway, I'll try to reproduce your problem here and start reading up on the RFC's to figure out where the problem lies, but I wouldn't rely on me for a solution to the problem you've found. It seems to me though, that you're at a very good point yourself to figure it out and propose a solution.

whariki commented 8 years ago

That's understood. Keen to contribute. Will keep working on it my end and post any progress. On 24/06/2016 10:49 am, "Joachim Nilsson" notifications@github.com wrote:

OK, thanks for the followup @masaraksh79 https://github.com/masaraksh79, sorry it didn't help!

This is an awesome bug report, but I'll have to be frank with you guys. First of all I'm on vacation, second I'm maintaining pimd as a bit of a side project, or hobby. The whole project relies on features and bug fixes provided by the community. This repository at GitHub is more of a focal point for that development.

Anyway, I'll try to reproduce your problem here and start reading up on the RFC's to figure out where the problem lies, but I wouldn't rely on me for a solution to the problem you've found. It seems to me though, that you're at a very good point yourself to figure it out and propose a solution.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/troglobit/pimd/issues/79#issuecomment-228208971, or mute the thread https://github.com/notifications/unsubscribe/AEoj9fBTL2mzRn1UkUhIoWo0AHGgVgwjks5qOw1sgaJpZM4I8aT8 .

masaraksh79 commented 7 years ago

@troglobit seems that we've not progressed much on this issue ourselves since June. It looks like our firm is willing to pay fixing the problem in this project, mostly because it own resource is constrained to other activities...

idismmxiv commented 7 years ago

Tested this a bit and at least in one test environment following helped.

diff --git a/timer.c b/timer.c
index de9a11e..e69129c 100644
--- a/timer.c
+++ b/timer.c
@@ -617,8 +617,8 @@ void age_routes(void)
                            }

                            /* iif info found */
-                           if ((srcentry_save.incoming != mrt_srcs->incoming) ||
-                               (srcentry_save.upstream != mrt_srcs->upstream)) {
+                           if ((srcentry_save.incoming != mrt_srcs->source->incoming) ||
+                               (srcentry_save.upstream != mrt_srcs->source->upstream)) {
                                /* Route change has occur */
                                update_src_iif = TRUE;
                                mrt_srcs->incoming = mrt_srcs->source->incoming;
@@ -643,7 +643,7 @@ void age_routes(void)
                                          mrt_srcs->joined_oifs,
                                          mrt_srcs->pruned_oifs,
                                          mrt_srcs->leaves,
-                                         mrt_srcs->asserted_oifs, 0);
+                                         mrt_srcs->asserted_oifs, MFC_UPDATE_FORCE);

                    check_spt_threshold(mrt_srcs);

troglobit commented 7 years ago

So, what do you guys think, should we merge @idismmxiv patch to the next release?

troglobit commented 7 years ago

Mika's (@idismmxiv) patch is now on a separate branch, ready for merge. Pending approval from the community, i.e. you guys.

troglobit commented 7 years ago

Hey, @whariki and @masaraksh79 maybe you've moved on to other things, if not it would be really appreciated if you could weigh in on the patch by @idismmxiv before I merge it to the upcoming pimd v3.0.

Thank you in advance! :) /Joachim

masaraksh79 commented 7 years ago

@troglobit, sorry for not replying with feedback to the forum on @idismmxiv patch. Tried this one last year but the test system didn't show an improvement, moreover, it had created more "noise". I'll clarify the result detail with my QA and would attempt to provide a better comment. Bottom line - it didn't solve my issue.

troglobit commented 7 years ago

@masaraksh79 No problem, thank you for checking back in! :smiley:

OK, that means I'll hold off on merging the patch (speculatively) until we know more.

idismmxiv commented 7 years ago

@masaraksh79. Just want to check again, regarding my patch proposal. You said that you tested similar change year ago, but did you notice the another change I made (not MFC_UPDATE_FORCE flag usage):

-                           if ((srcentry_save.incoming != mrt_srcs->incoming) ||
-                               (srcentry_save.upstream != mrt_srcs->upstream)) {
+                           if ((srcentry_save.incoming != mrt_srcs->source->incoming) ||
+                               (srcentry_save.upstream != mrt_srcs->source->upstream)

It looks to me like there is clear defect as in line 618 and 619. In line 608 mrt_srcs->source is passed as param, but in comparison in lines 618 and 619 in made against mrt_srcs->incoming.

masaraksh79 commented 7 years ago

The last test I did with this patch was in November last year. Unfortunately, did not have time to get back to it yet. Cannot say whether it has improved or made things worse but it clearly didn't solve our setup issues. Nevertheless, have no negative feedback about this code change. Cheers.

masaraksh79 commented 7 years ago

Ah, BTW, a comment I've received from the customer was: issue was "significantly mitigated" by configuring the RP to be in vicinity of the network traffic multicast source (I recall they have been using devices with PIM version older than 2.3.2; definitely -- without the last patch)

troglobit commented 7 years ago

@idismmxiv @zenvirta @masaraksh79 ... sounds to me like we can safely merge 0ba2966 to master for the 3.0 release then. Yet keeping this issue open?

masaraksh79 commented 7 years ago

from my point of view, there is no issue with the patch. Would appreciate if you do keep the issue open, yes.

troglobit commented 6 years ago

Merged. Thank you for taking the time to answer :)

idismmxiv commented 6 years ago

Thanks masaraksh79. Looks like there are still some issues in change_interfaces function as parameters are not used as they should. I'll try to look into that some day.

searlernz commented 6 years ago

I have been trying to investigate this further using 2.3.2 build and some additional debug logging in the age_routes function to see if there is anything suspicious in the region noted by @idismmxiv. Still haven't found anything but I have the following information about a similar breakage:

1) R2_46 cannot see multicast from either R4_101 or R3_117. 2) pimd_5 is sending encapsulated null register messages to Cisco_10 (the RP) via pimd_4 and receiving register_stop messages via pimd_7 (asymmetric routing) 3) Multicast packets from R4_101 are arriving at pimd_9 from pimd_7 but are not being sent to Cisco_10 4) Cisco_10 is sending JOIN messages to pimd_9 for source R4_101 5) pimd_9 is indicating that there is a join on the interface to Cisco_10 and that interface is outbound 6) pimd_9 is indicating SPT CACHE SG for the Source 101 S, G flags 7) pimd_9 is logging receipt of the JOIN messages and routing status indicates that the interface to pimd_7 is Input and that it has received prune on the interface to pimd_7 8) Cisco_10 indicates that the RPF for source 101 is pimd_7 and the forwarding is to Cisco_2 9) All of the pimd elements have traceroute for Cisco_10(RP) via Cisco_2 10) pimd_5 and pimd_7 are exchanging ASSERTs about source R4_101 and pimd_7 is indicating asserted state (the loser).

Would unencapsulated multicast packets from pimd_7 be passing through the pim daemon or would they be expected to be handled entirely within the kernel? It looks like router should be passing packets from pimd_7 to Cisco_10 but it is dropping them for some reason. NB. All of the above points referring to R4_101 also apply to R3_117 but I have a little less visibility into the packet flow for that source. Attached syslog and pimd -r output for pimd_9 in the broken state (changed IP addresses to match the naming in the diagram for the issue) pimd_9_syslog.txt pimd_status.txt

Any ideas?

idismmxiv commented 6 years ago

There was interesting log entry in every 3.30minutes: "S,G entry changed (timeout) causing interface switch at iif 1". I suppose you have added that entry to age_routes, but as it keeps coming over and over again, change does not make affect for some reason. Could you add diff or something to show what are your changes and in what places they are?

searlernz commented 6 years ago

Changes here

timer.c.txt

searlernz commented 6 years ago

I had a slight concern that the "S,G entry changed (timeout) ...." messages were caused by the link break/re-make actions taken to cause the issue but I can confirm that they are still occurring several days later with no network linking events on the system and the multicast routing still hasn't recovered. I suspect that they could be a fertile area for further investigation.

idismmxiv commented 6 years ago

I tried to reproduce the problem but as I do not have Cisco box for RP I had to use pimd for RP as well. Unfortunately there seems to be some issues in this kind of scenario in pimd RP functionality (but that is an another story).

Anyway by doing following <two changes, link break/remake looked like working.

First it still looks to me that patch #110 is mandatory.
Second, in timer.c function check_spt_threshold removes idle kernel cache entry but leaves? pimd state to previous state. So, I suggest that you try to remove line 260 delete_single_kernel_cache(mrt, kc);

you can also use command "ip mroute show all" to see how does multicast routing table look like from kernel point of view. If that is very different compared to pimd view of active routes, then situation is not good.

searlernz commented 6 years ago

Thanks for the update, that looks like a promising line of inquiry (and small enough that my fears of causing regressions elsewhere are reduced)

searlernz commented 6 years ago

Unfortunately, the devices running pimd have the truncated busybox implementation of ip so they don't support the ip mroute show all command. This is the output for cat /proc/net/ip_mr_cache (little-endian)

Group        Origin        Iif      Pkts    Bytes         Wrong    Oifs
010400EF 2EFBC80A 0          1      100            0           1:1  
010400EF 7537C80A  0      59447  5944700    59442
010400EF 6537C80A  0     118895 11889500  118884

Group 239.0.4.1 sources 10.200.55.101 and 10.200.55.117 (R3_117 and R4_101) have no Oif so I think that you are correct in suspecting a mismatch between the kernel and pimd views that is not healing.

Will try your proposed fixes next. Thanks

searlernz commented 6 years ago

Also worth noting that the Wrong count is going up with received packets and the pimd_7 device has Oif pointing at pimd_9 for the same S,G combinations so everything points at failing to update kernel cache correctly.

searlernz commented 6 years ago

Initial testing showing promise. Not exactly fast re-convergence (a few minutes) but so far not managed to get broken forever.

searlernz commented 6 years ago

I think that testing is now complete and successful for this specific use-case (I added a debug log at line 260 in place of the deletion and that path is definitely executed during testing so the change in behaviour appears to be necessary for the fix. The patch #110 log is not appearing during testing but I did some due diligence review of the code and agree that the code makes no sense at all without it. (It looks like a cut-paste and modify error based on line 412/413 so the pattern should be save existing route (lines 646,647) call set_incoming to determine new incoming/upstream (line 648) test if set_incoming has changed things and then update (lines 662,663 - pre patch - this is checking the wrong field for changes)

NB Lines 666 and 667 almost certainly hide the logic error and cause the correct behaviour to occur on the next pass through age_routes so it is arguably not necessary to have the patch in order to fix this issue.

For our use-case of very, very stable sparse group membership - basically every element joins once and then stays a member until it reboots, this fix appears to work and is only executed on network topology change and appears to have no side-effects. Not sure what regression tests would be needed to verify other use-cases where members/sources are more dynamic (the intent of the original code appears to be to clean up kernel caching for sources that are no longer in the group)

troglobit commented 6 years ago

This is great news, thank you for taking the time to debug this @searlernz ! Do you think you could whip up a patch or a pull-request to summarize the findings? Most of us coder types find it much easier to discuss around a piece of code rather than line numbers ... :)

searlernz commented 6 years ago

multicast_route_iif_update_fix.txt@@ -617,8 +616,8 @@ void age_routes(void)

The patch file contains three items (diff against 2.3.2) : 1) #110 fix which has already been incorporated

--- a/timer.c
+++ b/timer.c
@@ -617,8 +616,8 @@ void age_routes(void)
                }

                /* iif info found */
-               if ((srcentry_save.incoming != mrt_srcs->incoming) ||
-               (srcentry_save.upstream != mrt_srcs->upstream)) {
+               if ((srcentry_save.incoming != mrt_srcs->source->incoming) ||
+               (srcentry_save.upstream != mrt_srcs->source->upstream)) {
                /* Route change has occur */
                update_src_iif = TRUE;
                mrt_srcs->incoming = mrt_srcs->source->incoming;

@@ -643,7 +642,7 @@ void age_routes(void)
                      mrt_srcs->joined_oifs,
                      mrt_srcs->pruned_oifs,
                      mrt_srcs->leaves,
-                     mrt_srcs->asserted_oifs, 0);
+                     mrt_srcs->asserted_oifs, MFC_UPDATE_FORCE);

            check_spt_threshold(mrt_srcs);

NB. I am certain that the change to the conditional (first change block) corrects a logic error but would not change the behaviour of pimd (possibly apart from deferring acting on a route change for 20 seconds). The second change block forces route update even if the route appears not to have changed so that seems like it is important.

2) Fix to stop kernel cache routes getting out of synch with pimd routes with a debug log to confirm that the path is executed when attempting to reproduce issue

--- a/timer.c
+++ b/timer.c
@@ -257,10 +257,9 @@ static void check_spt_threshold(mrtentry_t *mrt)

    status = k_get_sg_cnt(udp_socket, kc->source, kc->group, &kc->sg_count);
    if (status || prev_bytecnt == kc->sg_count.bytecnt) {
-       /* Either (for whatever reason) there is no such routing
-        * entry, or that particular (S,G) was idle.  Delete the
-        * routing entry from the kernel. */
-       delete_single_kernel_cache(mrt, kc);
+      logit(LOG_DEBUG, 0, "Bypassing kernel cache deletion for S,G route %s,%s - idle",
+            inet_fmt(kc->source, s1, sizeof(s1)),
+            inet_fmt(kc->group, s2, sizeof(s2)));
        continue;
    }

3) A small modification to the pimd -r output to display the RP candidate table (otherwise the RP information is only shown for active groups which makes failover troubleshooting a bit harder than it needs to be. NB. Not required for the fix but I found it helpful

--- a/debug.c
+++ b/debug.c
@@ -272,6 +272,7 @@ void fdump(int i __attribute__((unused)))
     if (fp) {
    dump_vifs(fp);
    dump_pim_mrt(fp);
+   dump_rp_set(fp);
    fclose(fp);
     }
 }

searlernz commented 6 years ago

Any idea when this change might be incorporated in a formal release?

troglobit commented 6 years ago

@searlernz If possible, I'd like to first get some peer review of 2. from @idismmxiv, but a v2.4.0 release is long overdue.

idismmxiv commented 6 years ago

I comment only step 2 here as one is already part of master and 3 is just extra debugging, As said earlier, given change seems to fix the problem described here. Also this looks fairly safe change.

Only thing I am wondering here is this statuscomparison in if statement. In what circumstances status can be non zero, and what would happen in such scenario? At least I don not know.

troglobit commented 6 years ago

@idismmxiv status is non-zero only if we fail to read the SG stats from the kernel. Like the comment says, "Either (for whatever reason) there is no such routing entry, or that particular (S,G) was idle ..."

But if there is no such routing entry there's no need to delete it, right?

The latter case, when prev_bytecnt == kc->sg_count.bytecnt, there has been no activity (idle), and that was reason enough to remove it. Which seems bogus without some other cleanup mechanism to also signal neighbor routers that we no longer want the S,G ... but now I'm only speculating.

idismmxiv commented 6 years ago

@troglobit, you are right about no need to remove unexisting route.

But there might be several reasons why given mrt has been idle. Sender has not sent anything for 20 secs or route is broken temporarily. Receiver still might want to receive that stream if any data happens to become available. So I think its up to receiver to tell us when to quit forwarding.

I think following is what actually happened in this issue. Route was lost and it took over 20 sec to recover. That caused given MRT entry to be removed from kernel. When next packet entered to kernel, it caused cache__miss. Put incoming packet come from different interface than before, and it was handled in somehow unwanted manner and that caused multicast stream to never recover.

Now after proposed change, mrt entry is not removed from kernel and rerouted multicast packet does not cause cache_miss, but wrong_iif event, and that code path enables routing recovery.

So I expect there to be several different ways to fix this, but I do not know what would be the right one.

troglobit commented 6 years ago

@idismmxiv Agreed, that explanation sounds very plausible. I guess we should start by applying this patch, to at least let routing recover.

troglobit commented 6 years ago

Patch 2. merged in 7ef29de, with extra comment referencing this issue.

Hopefully some brave soul steps up to investigate this in more detail and can propose a better fix, or address the long reconfiguration time.

masaraksh79 commented 6 years ago

@troglobit ,FYI, our customer has rolled out radio firmware with kernel route fix in 2.3.2 and they're quite satisfied of its performance on their larger system, from the feedback I've received today. Good stuff!

PS. @searlernz

troglobit commented 6 years ago

That's great news, @masaraksh79, thank you so much for the update! :smiley::+1:

troglobit commented 5 years ago

@masaraksh79 + @searlernz is there anything blocking this issue, or can we close it?

masaraksh79 commented 5 years ago

Reckon bug #79 can be closed.

troglobit commented 5 years ago

@masaraksh79 Thanks! :-)

troglobit / pimd

Multicast routes don't always get their iif updated based on unicast routing changes #79