traviscross / mtr

Official repository for mtr, a network diagnostic tool
http://www.bitwizard.nl/mtr/
GNU General Public License v2.0
2.64k stars 337 forks source link

Explaining a questionable measurement result #450

Open elfring opened 1 year ago

elfring commented 1 year ago

I built the evolving network analysis tool according to the development revision “v0.95-17-g826ffa9”.

I tried the program out then accordingly.

Sonne:/home/altes_Heim2/elfring/Projekte/Bau/mtr/console # ./mtr --report-cycles 5 --report download.opensuse.org
Start: 2022-09-10T13:05:51+0200
HOST: Sonne                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- fritz.box                  0.0%     5    0.9   7.7   0.8  35.0  15.3
  2.|-- ???                       100.0     5    0.0   0.0   0.0   0.0   0.0
  3.|-- 2a00:6020:0:b::1          40.0%     5    6.7   7.1   6.7   7.4   0.3
  4.|-- as33891.dusseldorf.megapo  0.0%     5    6.7   6.6   6.2   7.1   0.3
…

How does this measurement fit to another test result like the following? :thinking:

Markus_Elfring@Sonne:~> ping -c5 2a00:6020:0:b::1
…
--- 2a00:6020:0:b::1 ping-Statistik ---
5 Pakete übertragen, 5 empfangen, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 6.343/6.645/7.373/0.374 ms
rewolff commented 1 year ago

Perfectly.

Both MTR and ping report around 6.6-7ms round trip times to that 2a00:6020:0:b::1 host.

Across 10 packets sent to that host the fastest at 6.3ms was recorded by ping and the slowest of 7.4ms was recorded by both as the 7.373 also rounds to 7.4.

(The difference in average is easily explained by "random fluctuations" or the say 5-10 seconds time-of-day change between the two measurements).

elfring commented 1 year ago

:thought_balloon: I find that the displays “??? … 100.0” and “40.0%” would need further clarifications (according to the information “Loss%”).

rewolff commented 1 year ago

Some routers send back the recommended: "your packet encountered an error under my supervision". This is what MTR and "traceroute" use to do their thing.

However some don't. So although it can be deduced from what DID come back, that there must be a router at position 2 in the path to the "downloads" machine, it did not send back ANY response when it was supposed to. So we can't figure out its IP address or anything else (like round trip time).

The host 2a00 responded to 40% (i.e. 2 out of 5) of the probes sent by mtr.... Some hosts take the middle route in sending back those error packets: Say an X percentage or "max 1 per second for the whole machine".

All these choices to "not send back packets when an error occurs" are reasonably valid when those errors are happening because "the network is overloaded and in a <can't cope anymore!> situation"... but when carefully crafting packets to trigger errors for network diagnosis reasons it is "annoying" when manufacturers turn them completely or partially off.

Alas, we'll have to deal with it.

elfring commented 1 year ago

…, that there must be a router at position 2 in the path to the "downloads" machine, it did not send back ANY response when it was supposed to. So we can't figure out its IP address or anything else …

:thought_balloon:

The host 2a00 responded to 40% (i.e. 2 out of 5) of the probes sent by mtr....

I found this display suspicious after the determination that no packets were lost by other hosts in the mentioned measurement configuration.

Say an X percentage or "max 1 per second for the whole machine"

Would it be needed then to adjust the probing frequency any further? :thinking:

rewolff commented 1 year ago

_

Would it be needed then to adjust the probing frequency any further?

_ I noticed the "max 1 per second" behaviour when a "no arguments" mtr would give 100% returns on a host. Then starting a second mtr reported 50% packetloss to that same host. Looking back to the first trace, it too had started seeing packetloss... And it went away when I stopped the second mtr.

I say it is unavoidable, and the default is reasonable. You might have had someone else run a traceroute / mtr at the same time, or that host is configured to only return an error every 2 or 3 seconds.

I don't understand what you mean by "safe". Mtr sent a packet that should've been returned-with-error by the host in the second position. It didn't send any errors back, it just dropped them on the floor.

(Sometimes they violate protocol and simply pass them on. Then you see a single host at both position 2 and 3. That's not happening here).

Imagine a country called netland. Lots of trains going everywhere, but all trains run between just two stations. Railroads provide a service that you can send parcels to remote stations by adding a wad of cash: Each station will take 1 bill and forward it towards the destination. Normally you attach plenty of cash and the parcels simply arrive and you can communicate with the stationmaster at the other end. But mtr sends the parcels on their way with only 1, 2, 3 banknotes attached. When there are no more banknotes, the stationmaster is recommended to send back a letter (attaching his own cash!) saying that he couldn't forward your parcel due to "out of money". But some of them are lazy bastards!

elfring commented 1 year ago

I don't understand what you mean by "safe".

Is the positioning of the shown triple question marks meaningful according to the involved data transmissions?

Mtr sent a packet

Which addresses were used for the target hosts?

that should've been returned-with-error by the host in the second position.

Did you expect a corresponding feedback by the affected network component?

It didn't send any errors back, it just dropped them on the floor.

Did another response message not arrive within the usual time constraints?

rewolff commented 1 year ago

In my netland analogy, mtr sends a parcel with just 1 bill attached and gets back Station master Fritz.box couldn't forward your parcel. mtr sends a parcel with just 2 bills attached, never to be heard of again. mtr sends a parcel with just 3 notes attached and stationmaster 2a00:... responds with a note: "Stationmaster 2a00:... Couldn't forward your parcel due to insufficient funds".

So yeah, we know the lazy stationmaster is at the second station, but we dont' know his name, because he didn't send any notes back.

All parcels are addressed to the final host. I think. Some people have a situation like yours where host 3 does (occasionally) respond with the (error) notes, but gladly responds to "Dear stationmaster, please send this back if you recieve this, thanks!" (ping). The issue is that some hosts are precisely the reverse. They do NOT answer to the ping requests. So you'd get a 100% packetloss next to a host where we DO know its name.

elfring commented 1 year ago

So yeah, we know the lazy stationmaster is at the second station,

I would like to be sure that other positions are not affected during the discussed system test.

but we dont' know his name,

Can any identifier be determined by special approaches?

because he didn't send any notes back.

Would any other hints make the impact of not responding participants clearer?

All parcels are addressed to the final host. I think.

Can any additional addresses be used for intermediate hosts?

They do NOT answer to the ping requests.

Are corresponding messages ignored because of a special system configuration at this place?

So you'd get a 100% packetloss next to a host where we DO know its name.

Thanks for such a clarification.

elfring commented 1 year ago

Mtr sent a packet …

How will the chances evolve then to display an usable address instead of question marks? :thinking:

elfring commented 1 year ago

Would you get further development ideas from information like the following? :thinking:

Sonne:~ # traceroute --icmp --max-hops=99 download.opensuse.org
…
 1  fritz.box (…)  0.893 ms  1.141 ms  1.423 ms
 2  100.68.0.1 (…)  7.298 ms  7.295 ms  7.434 ms
 3  100.127.1.133 (…)  9.447 ms  9.445 ms  9.442 ms
 4  100.127.1.132 (…)  11.012 ms  11.008 ms  11.005 ms
 5  185.22.46.129 (…)  12.001 ms  12.140 ms  12.281 ms
 6  as33891.dusseldorf.megaport.com (…)  11.130 ms  7.163 ms  7.243 ms
…
Sonne:~ # traceroute --tcp --max-hops=99 download.opensuse.org
…
 1  fritz.box (…)  0.926 ms  1.089 ms  1.262 ms
 2  100.68.0.1 (…)  6.109 ms  6.150 ms  6.226 ms
 3  100.127.1.132 (…)  7.732 ms  8.808 ms  7.864 ms
 4  185.22.46.129 (…)  10.144 ms  9.030 ms  11.827 ms
 5  as33891.dusseldorf.megaport.com (…)  8.807 ms  9.115 ms  10.665 ms
…
 8  100.127.1.132 (…)  6.630 ms !X 195.135.221.26 (…)  10.317 ms 100.127.1.132 (…)  6.091 ms !X
elfring commented 1 year ago

How do you think about to take another look at an information source like “United States Patent 11,356,352 (from 2022-07-06): Identifying reachability of network-connected devices”? :thinking:

rewolff commented 1 year ago

Are corresponding messages ignored because of a special system configuration at this place? Yes.

If "traceroute" does get a response and mtr doesn't, we should analyse the packets sent and find their differences. When we know that, we can consider if we may want to emulate traceroute or not. The thing is mtr has evolved to work quite well, and under certain circumstances it is just slightly different from traceroute. If we decided to make it different because in certain cases we WOULD get a response where traceroute didn't then it is a difficult decision to change to the exact packets that traceroute uses.

elfring commented 1 year ago

If "traceroute" does get a response and mtr doesn't, we should analyse the packets sent and find their differences.

Would you like to check data processing effects any more for the affected host(s) also according to known addresses in IPv4 and IPv6 format?

The thing is mtr has evolved to work quite well, …

How much does this software distinguish between (intermediate) source and target addresses for the discussed data display?

osbjmg commented 1 year ago

I think it's possible your situation is specific to the behavior in your network right now. It seems like a good time to tcpdump and see if hop number two is sending TTL exceeded messages. If so, then it would be time to troubleshoot the path back and the stacks along the way.

elfring commented 1 year ago

I think it's possible your situation is specific to the behavior in your network right now.

Yes, of course.

💭 I am trying to improve affected network components with further help somehow.

elfring commented 1 year ago

I took another look at data from the following program test run.

Sonne:…/Bau/mtr/console # strace -o strace-20220915.txt ./mtr --report-cycles 1 --report download.opensuse.org

It seems that 290 function calls were recorded. Would you like to add special comments for calls of the function “connect” (and corresponding checks for error codes)?

elfring commented 1 year ago

Do you get further development ideas from information sources like the following? :thinking: