Open locasity opened 6 years ago
As to your 1) ... When PIDs wrap, the no-longer-in-use ones will be reused, but the ones where a process is still running will NOT be reused.
The loop 4) has caused a few "too much CPU time" issues already... Can you, in your situation when you see one going bonkers issue an
strace -tttp <the pid of the high-load process>
?
For (1) I think you can run into a scenario where a previous ICMP echo request was issued by ping or traceroute tool with a pid P, and P was reused by mttr. Note that this only happens on certain long running mttrs on boxes that have a high volume of ping and traceroute processes running, so the race condition is more nuanced and probably has something to do with interference with ping/traceroute.
Will try to get you a better strace. Here's what we could grab from one of the from yesteday (IP address anonymized)-
"", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 2 (in [0 6], left {0, 0}) gettimeofday({1524174602, 672636}, NULL) = 0 recvfrom(6, "E\0\0T\250@@\0\340\1\221\3324S!\367\n\0\0D\10\0gl8\371\0\5\n\17\331Z"..., 4470, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.0.2.0")}, [16]) = 84 read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0
This looks like one of the bugs that was already fixed. So... Are you running the latest version?
We run a high number of pings, traceroutes, and occassional mtrs on a single box. We found that somtimes mtrs run for a long time and take up upto 90% of CPU on the box.
Here's what I think is going on-
Can someone confirm this line of reasoning? Thanks!