troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
197 stars 87 forks source link

Linux kernel warns: 'mroute: pending queue full' -- pimd not responding? #44

Closed greenpau closed 9 years ago

greenpau commented 9 years ago

Receiving the following log messages:

May 19 20:02:24 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries
May 19 20:02:24 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries
May 19 20:02:30 ip-192-168-16-66 kernel: net_ratelimit: 343 callbacks suppressed
May 19 20:02:30 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries
...
May 19 20:02:30 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries
May 19 20:02:36 ip-192-168-16-66 kernel: net_ratelimit: 118 callbacks suppressed
May 19 20:02:36 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries

Opened this issue to track the progress.

troglobit commented 9 years ago

Hmm, those are kernel messages. Looks like pimd for some reason does not respond to unknown (new) routes? I've never seen this before, sorry!

Maybe this can happen if there's a lot (!) of traffic and pimd is busy with something else, or is SIGSTOP'ed?

greenpau commented 9 years ago

@troglobit : or is SIGSTOP'ed?

Nope :smile:

@troglobit: there's a lot (!) of traffic and

it is relative.

@troglobit: pimd is busy with something else

more likely

will keep poking around ...

greenpau commented 9 years ago

In 3 seconds I received 2,565,451 bytes or 2.5 Mbytes (179918652 - 177353201) on weave interface:

[root@ip-192-168-16-66 ~]# date;cat /proc/net/ip_mr_vif
Thu May 21 13:46:32 UTC 2015
Interface      BytesIn  PktsIn  BytesOut PktsOut Flags Local    Remote
 2 weave      177353201  235203         0       0 00000 4201010A 00000000
 3 tun0              0       0         0       0 00000 0B00FE0A 4D00FE0A
 4 tun1              0       0         0       0 00000 0C00FE0A 6E00FE0A
 5 tun2              0       0         0       0 00000 0D00FE0A 3B00FE0A
 6 pimreg            0       0  177353201  235203 00004 4201010A 00000000
[root@ip-192-168-16-66 ~]# 
[root@ip-192-168-16-66 ~]# 
[root@ip-192-168-16-66 ~]# date;cat /proc/net/ip_mr_vif
Thu May 21 13:46:35 UTC 2015
Interface      BytesIn  PktsIn  BytesOut PktsOut Flags Local    Remote
 2 weave      179918652  239526         0       0 00000 4201010A 00000000
 3 tun0              0       0         0       0 00000 0B00FE0A 4D00FE0A
 4 tun1              0       0         0       0 00000 0C00FE0A 6E00FE0A
 5 tun2              0       0         0       0 00000 0D00FE0A 3B00FE0A
 6 pimreg            0       0  179918652  239526 00004 4201010A 00000000
[root@ip-192-168-16-66 ~]# 

and still getting:

May 21 13:51:25 ip-192-168-16-66 kernel: mroute: pending queue full, dropping entries
troglobit commented 9 years ago

The only conclusion I can draw from this is that you must be receiving a lot of unknown multicast on the weave interface.

The warning message is from when the kernel tries to notify pimd (or any other multicast routing daemon) of new potential multicast to route. If pimd has a "remote request" for that group it will add it to the kernel multicast routing table. However, I'm not sure what should happen if there are no known receivers for that group ... could be a bug, I actually don't know.

troglobit commented 9 years ago

Looking at the code, route.c:process_cache_miss(), I noticed that we have a debug message for this particular mechanism!

Try starting the pimd daemon process with the --debug=cache command line option. You should see lines like the following in the system logfile(s):

18:48:42.618 Cache miss, src 192.168.1.131, dst 239.255.255.250, iif 0

With that we could at least try to ascertain which group/groups are causing this problem for you.

greenpau commented 9 years ago

@troglobit : However, I'm not sure what should happen if there are no known receivers for that group

:+1:

that is what happening. The drawback of weave overlay network is that hosts receive unsolicited multicast.

I will enable --debug=cache and report back my findings.

troglobit commented 9 years ago

Hmm, didn't know what weave was before ... isn't there something else, with better filtering, you could use altogether? Like GRE or VXLAN?

greenpau commented 9 years ago

@troglobit: isn't there something else, with better filtering, you could use altogether? Like GRE or VXLAN?

I tend to agree with you. However, we are entering the era of docker containers. Therefore, a lot of people are trying to create overlay networks for docker environments.

After enabling --debug=cache I can now see the misses:

May 21 20:17:55 ip-192-168-16-66 pimd[25981]: 20:17:55.940 Cache miss, src 10.1.1.2, dst 239.1.0.1, iif 2
May 21 20:17:55 ip-192-168-16-66 pimd[25981]: 20:17:55.940 create group entry, group 239.1.0.1

weave acts like a peer in PIM Dense mode.

greenpau commented 9 years ago

@troglobit , I am reopening this to ask the following question:

Jun 04 18:16:52 ip-192-168-16-66 pimd[7399]: 18:16:52.642 Cache miss, src 10.1.1.1, dst 239.24.0.70, iif 7
Jun 04 18:16:21 ip-192-168-16-66 pimd[7399]: 18:16:21.120 Removed MFC entry src 10.1.1.1, grp 239.24.0.70
Jun 04 18:12:52 ip-192-168-16-66 pimd[7399]: 18:12:52.643 create group entry, group 239.24.0.70

I have the above 3 entries in my logs. They are listed in reverse order.

First, MFC entry gets created. Then, 4 minutes later, it is removed. Finally, I see a "cache miss".

Does the "cache miss" mean that there is no entry in MFC?

If so, how can I extend the life of the entry in MFC cache?

greenpau commented 9 years ago

The MFC entries disappear, because they are no multicast listeners detected. In my case it happened because a Windows node failed to respond with Membership Reports to General Membership query.

pimd created the entry after receiving initial, unsolicited Membership Report.