Closed f00b4r0 closed 5 months ago
Provided that the "netcat" package is installed, that the LAN IP is 192.168.1.1 and almost no client devices are present, the following script will trigger the bug:
#!/bin/sh
sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096
for i in $(seq 2 254); do
echo "" | netcat -c -u 192.168.1.$i 65534 # create a large number of NUD FAILED neighbours
done
sleep 5
sysctl -w net.ipv4.neigh.default.gc_thresh1=128
I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.
After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on
RTNLGRP_NEIGH
that prints the received messages.Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages (
"cmd": 29
), with no error what so ever.On a system where the neigh GC is set like so:
(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when
thresh1
is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".I provide a memdump of the script taken after the hang.
rtnlbug.uc.txt ucode.1703872887.23407.memdump.txt