jow- / ucode

JavaScript-like language with optional templating
ISC License
87 stars 24 forks source link

rtnl.listener dies on message burst #184

Closed f00b4r0 closed 5 months ago

f00b4r0 commented 6 months ago

I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.

After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on RTNLGRP_NEIGH that prints the received messages.

Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages ("cmd": 29), with no error what so ever.

On a system where the neigh GC is set like so:

net.ipv4.neigh.default.gc_thresh1=512
net.ipv4.neigh.default.gc_thresh2=2048
net.ipv4.neigh.default.gc_thresh3=4096

(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when thresh1 is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".

I provide a memdump of the script taken after the hang.

rtnlbug.uc.txt ucode.1703872887.23407.memdump.txt

f00b4r0 commented 6 months ago

Provided that the "netcat" package is installed, that the LAN IP is 192.168.1.1 and almost no client devices are present, the following script will trigger the bug:

#!/bin/sh

sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

for i in $(seq 2 254); do
    echo "" | netcat -c -u 192.168.1.$i 65534   # create a large number of NUD FAILED neighbours
done

sleep 5

sysctl -w net.ipv4.neigh.default.gc_thresh1=128