m-lab / murakami

Run automated internet measurement tests in a Docker container.
Apache License 2.0
40 stars 12 forks source link

Issue with Murakami ndt7 agent not running at regular intervals #111

Open 11gingerbread opened 1 year ago

11gingerbread commented 1 year ago

For our setup, we build a Murakami server 'on-prem' (Base Operating System CentOS 8) and we were querying this server for download / upload speed of several servers using nd7-client running in container.

We never got the upload values in the range we expected it so we were doubting this data and I ended up using "regions": [ "US-NY" ] in "ndt7-custom-config.json" which started giving us values closer to expected values.

I am still running into couple of issues with my configuration and seeking suggestions to fix / improve it.

1) I have configured "tests-per-day = 24" but I don't get roughly 1 test per hour. Sometimes Murakami will run 4 tests in 1 hour and then won't run any for next 4 hours which doesn't produce a very good dataset for me. Is there anyway we can configure / force one Murakami test per hour ?

2) I am using "regions": [ "US-NY" ] in my "ndt7-custom-config.json" config file, is there a way for me to configure NDT Server closest to my server location as my servers are geographically dispersed. Also, Hardcoding a region doesn't give me any redundancy either so if the Test can run against nearest 'online' location, it would solve this issue too.

3) For running Ad-hoc speed tests, I was using below command with my on-prem NDT Server, Can you please suggest the command format for speedtest from nearest Mlabs NDT Server -

docker exec ndt7-client -format json -quiet -server=:8080 -scheme=ws

4) the json file generated from murakami is as below -

{"TestName": "ndt7", "TestStartTime": "XXXXXXXXXXx", "TestEndTime": "XXXXXXXXXXX", "MurakamiLocation": "Server", "MurakamiConnectionType": "wired", "MurakamiNetworkType": "Clear", "MurakamiDeviceID": "", "ServerName": "ndt-mlab2-yul01.mlab-oti.measurement-lab.org", "ServerIP": "162.213.100.216", "ClientIP": "X.X.X.X", "DownloadUUID": "XXXXXXXXXXXX", "DownloadValue": 119.73344173353648, "DownloadUnit": "Mbit/s", "DownloadError": null, "UploadValue": 7.600105816737266, "UploadUnit": "Mbit/s", "UploadError": null, "DownloadRetransValue": 1.5427647469121153, "DownloadRetransUnit": "%", "MinRTTValue": 82.402, "MinRTTUnit": "ms"}

Is there a way for me the get Latency and Jitter from this test besides Upload/Download Speed in the above json output.

5) Also, last but not least, what could be the possible reason that the values from my on-prem NDT server are drastically different than the ones I am getting from Mlabs NDT server.

11gingerbread commented 1 year ago

@robertodauria Roberto, Can you please look into this.

robertodauria commented 1 year ago

@11gingerbread The randomization of intervals is intentional, not a bug. Tests are scheduled with a exponential distribution centered around the configured interval (i.e. around 1 hour in your case): https://github.com/m-lab/murakami/blob/a8d7ff541de66e5a5859330e166c263125bd9e31/murakami/server.py#L37

Over the long run (e.g. a year), you'll get about 24 tests per day on average but any two individual subsequent tests will not be exactly 1h apart. There is currently no way in Murakami to use a fixed interval, nor it's something M-Lab recommends -- please see https://www.measurementlab.net/develop/#best-practices-on-test-scheduling-and-frequency, specifically the paragraph about software and hardware integrations.

It basically boils down to:

  1. fixed intervals produce predictable patterns of high load instead of distributing the load over time (which is bad for the platform)
  2. fixed intervals produce worse measurements since the user won't see cyclical patterns by measuring every day at the same time (which is bad for the user)

Using a fixed custom region is generally worse than letting the client query the M-Lab load balancer for the closest M-Lab server, and will give you terrible results if clients are distributed around the world (e.g. a client in India will have very high RTT and low throughput to a server in US-NY)

As per your question about latency, I believe you want TCPInfo's MinRTT which is MinRTTValue in that JSON. For the jitter, I answered the same question at https://github.com/m-lab/murakami/issues/108#issuecomment-1131801927: PRs adding RTTVar to ndt7-client-go's summary and to murakami would be welcome, I don't have time to work on it at the moment but I'm happy to review them.

Regarding your on-premise ndt-server instance, assuming by "different" you mean "worse", the first thing I would check is that the server itself is not the bottleneck -- e.g. what's the CPU load while running N measurements? Do you see different performance when running a single measurement vs a hundred in parallel? Do you have any way to measure a link with known capacity (e.g. a second machine on the same switch as the server) to exclude a CPU bottleneck?

Assuming the server capacity is not the bottleneck, then it likely depends on the different network paths. M-Lab servers are usually located at Internet interconnection points and connected to multiple major transit providers in each geographic area.

Different paths are expected to produce different results, and the result you see is indicative of what the TCP throughput between that client and that server is at that time. Changing the client location or the server location (or anything in the path, really) usually changes the result.

I recommend opening a separate GitHub issue (https://github.com/m-lab/ndt-server or https://github.com/m-lab/ndt7-client-go) to further explore these performance differences when using the ndt7 protocol. Murakami is a test scheduler for external measurement clients, and your question is more targeted towards a specific testing protocol and implementation.

laiyi-ohlsen commented 1 year ago

Hi @11gingerbread, thanks for writing and your interest in Murakami and NDT. It would be great to learn more about your use case - would you be open to reaching out to support@measurementlab.net so we can discuss further?