rtr7 / router7

router7 is a small home internet router completely written in Go. It is implemented as a gokrazy appliance.
https://router7.org
Apache License 2.0
2.69k stars 110 forks source link

neighbour: arp_cache: neighbor table overflow! #31

Closed stapelberg closed 5 years ago

stapelberg commented 5 years ago

This is the first time I have encountered the problem, but it is puzzling.

From the serial log:

2019/06/05 06:08:35 dhcp4d.go:148: DHCPACK &{Num:25 Addr:10.0.0.27 HardwareAddr:(removed) Hostname:scan2drive Expiry:0001-01-01 00:00:00 +0000 UTC}
[843041.359054] neighbour: arp_cache: neighbor table overflow!
[843041.364702] neighbour: arp_cache: neighbor table overflow!

These messages keep repeating multiple times per second.

tcpdump shows no suspicious traffic on either uplink0 or lan0.

The neighbor table garbage collection settings are unchanged from the default:

# sysctl -a | grep neigh
[…]
net.ipv4.neigh.default.gc_interval = 30
net.ipv4.neigh.default.gc_stale_time = 60
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024
[…]

The error message (arp_cache instead of ndisc_cache) leads me to believe that the problem is IPv4-related, though the IPv6 neighbor table only contains FAILED, INCOMPLETE and NOARP entries for lan0 (maybe a symptom caused by the IPv4 issue?).

Anyway, the IPv4 neighbor table only seems to contain one entry:

# ./ip -4 neigh show nud all
212.51.156.1 dev uplink0 lladdr 00:24:14:ef:72:ff REACHABLE

(In normal operation, it contains only one entry on uplink0, but a whole bunch of entries on lan0.)

I also checked /proc/net/stat/arp_cache:

entries  allocs destroys hash_grows  lookups hits  res_failed  rcv_probes_mcast rcv_probes_ucast  periodic_gc_runs forced_gc_runs unresolved_discards table_fulls
00000001  00001e12 000029f2 00000000  00338bff 001a39c2  000003e5  00000000 00000000  00000000 000015e8 00000000 00000b60
00000001  0000128d 00001860 00000000  00000000 00000000  000002a2  00000000 00000000  00000000 00000d85 00000000 00000b09
00000001  00002952 00002910 00000000  00000000 00000000  000003fe  00000000 00000000  0000d635 000017b0 00000000 00000b41
00000001  00005498 00004326 00000002  0012915b 000eef8d  000001b0  00000000 00000000  00000000 00004f53 00000071 00002e2a

I tried inserting a new entry into the neighbor table:

execve("./arp", ["./arp", "-s", "10.0.0.76", "(removed)"], 0x7ffd220b0438 /* 7 vars */) = 0
brk(NULL)                               = 0x21ce000
brk(0x21cf200)                          = 0x21cf200
arch_prctl(ARCH_SET_FS, 0x21ce8c0)      = 0
uname({sysname="Linux", nodename="router7", ...}) = 0
readlink("/proc/self/exe", "/perm/sh", 4096) = 8
brk(0x21f0200)                          = 0x21f0200
brk(0x21f1000)                          = 0x21f1000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
getuid()                                = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
ioctl(3, SIOCSARP, 0x7ffe5607713c)      = -1 ENOBUFS (No buffer space available)
write(2, "arp: SIOCSARP: No buffer space a"..., 41arp: SIOCSARP: No buffer space available
) = 41
exit_group(1)                           = ?
+++ exited with 1 +++

I also checked free memory:

             total         used         free       shared      buffers
Mem:       4020136       561812      3458324        38352       101844
-/+ buffers:             459968      3560168
Swap:            0            0            0

It’s a mystery to me how the neighbor table can be considered full with only one entry in it.

This is with Linux 5.1.1.

stapelberg commented 5 years ago

From reading linux-5.1.1/net/core/neighbour.c, the most likely issue seems gc_list and/or gc_entries going out of sync with the actual neighbor table entries. I won’t claim that I understand the code, though :)

ardje commented 5 years ago

Although closed... Just wanted to update your information: https://lists.netfilter.org/pipermail/netfilter/2002-November/040337.html These days however there should be no more flow cache for IPv4. However I don't know if the IPv6 still contains a flow cache. Anyway a router should usually have a gc_thresh1 > 128 ... a lot bigger.