traviscross / mtr

Official repository for mtr, a network diagnostic tool
http://www.bitwizard.nl/mtr/
GNU General Public License v2.0
2.65k stars 338 forks source link

high CPU usage on long running mtrs #253

Open locasity opened 6 years ago

locasity commented 6 years ago

We run a high number of pings, traceroutes, and occassional mtrs on a single box. We found that somtimes mtrs run for a long time and take up upto 90% of CPU on the box.

Here's what I think is going on-

  1. Packet is identified using uint16 unix pid When we do a high number of ping and traceroutes on the box the pid will warp around after 2^16 and you could get into a situation where a ping's icmp echo response acknowledges a mtr echo request.
  2. rtt is calculated based on pid match and rtt is assumed to never be too big or negative (comment literally says it is impossible) This could happen in the above situation discussed in (1), or if the system clock goes back in time. https://github.com/traviscross/mtr/blob/6377a5ae7cbb4d2a9cd8280a903f85d44d4ae21d/net.c#L615
  3. How long to wait for response packet is calculated based off' rtt. If rtt is too high we will be waiting for a long time, years maybe https://github.com/traviscross/mtr/blob/76e66e1b1dfbc29309155448230b50f576765ce9/ui/select.c#L70
  4. while waiting it's doing IO read in a while(1) loop, and thats probably why the high CPU util https://github.com/traviscross/mtr/blob/76e66e1b1dfbc29309155448230b50f576765ce9/ui/select.c#L181

Can someone confirm this line of reasoning? Thanks!

rewolff commented 6 years ago

As to your 1) ... When PIDs wrap, the no-longer-in-use ones will be reused, but the ones where a process is still running will NOT be reused.

The loop 4) has caused a few "too much CPU time" issues already... Can you, in your situation when you see one going bonkers issue an strace -tttp <the pid of the high-load process> ?

locasity commented 6 years ago

For (1) I think you can run into a scenario where a previous ICMP echo request was issued by ping or traceroute tool with a pid P, and P was reused by mttr. Note that this only happens on certain long running mttrs on boxes that have a high volume of ping and traceroute processes running, so the race condition is more nuanced and probably has something to do with interference with ping/traceroute.

Will try to get you a better strace. Here's what we could grab from one of the from yesteday (IP address anonymized)-

"", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 2 (in [0 6], left {0, 0}) gettimeofday({1524174602, 672636}, NULL) = 0 recvfrom(6, "E\0\0T\250@@\0\340\1\221\3324S!\367\n\0\0D\10\0gl8\371\0\5\n\17\331Z"..., 4470, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.0.2.0")}, [16]) = 84 read(0, "", 1) = 0 select(10, [0 6 8 9], [], NULL, {0, 0}) = 1 (in [0], left {0, 0}) read(0, "", 1) = 0

rewolff commented 6 years ago

This looks like one of the bugs that was already fixed. So... Are you running the latest version?