Open aorith opened 2 months ago
Hi!
Oh and one more thing: is it only the graphs or also the command line output that is wrong? (either run --headless
or add --print-stats
)
Hey!
I've been testing using HTTPUser and I get the same result, also it's not only the chart, the statistics are also wrong, which is also visible when running the master with --print-stats
as you pointed out!
I've ran a smaller test (same settings but limited to 50 users, only 17 workers (1 per node) and HTTPUser
class).
I also added --csv $file --csv-full-history
so you can see some numbers.
$ awk -F',' '{printf $1","$5","$18"\n"}' out.csv_stats_history.csv
Timestamp,Requests/s,Total Request Count
1715176110,0.000000,0
1715176111,0.000000,0
1715176112,0.000000,0
1715176113,0.000000,0
1715176114,0.000000,0
1715176115,0.000000,0
1715176116,0.000000,0
1715176117,0.000000,0
1715176118,0.000000,0
1715176119,0.000000,0
1715176120,0.000000,0
1715176121,0.000000,0
1715176122,0.000000,0
1715176123,0.000000,0
1715176124,0.000000,0
1715176125,0.000000,0
1715176126,0.000000,0
1715176127,0.000000,0
1715176128,0.000000,0
1715176129,0.000000,0
1715176130,0.000000,0
1715176131,0.000000,0
1715176132,0.000000,408
1715176132,0.000000,408
1715176133,0.000000,781
1715176133,0.000000,781
1715176134,0.000000,830
1715176134,0.000000,830
1715176135,0.000000,3918
1715176135,0.000000,3918
1715176136,36.200000,8341
1715176136,36.200000,8341
1715176137,36.200000,8566
1715176137,36.200000,8566
1715176138,36.200000,13239
1715176138,36.200000,13239
1715176139,97.100000,22190
1715176139,97.100000,22190
1715176140,97.100000,22489
1715176140,97.100000,22489
1715176141,97.100000,27279
1715176141,97.100000,27279
1715176142,447.800000,37087
1715176142,447.800000,37087
1715176143,447.800000,37385
1715176143,447.800000,37385
1715176144,447.800000,42201
1715176144,447.800000,42201
1715176145,815.500000,52127
1715176145,815.500000,52127
1715176146,815.500000,52427
1715176146,815.500000,52427
1715176147,815.500000,57226
1715176147,815.500000,57226
1715176148,1147.200000,67299
1715176148,1147.200000,67299
1715176149,1147.200000,67600
1715176149,1147.200000,67600
1715176150,1147.200000,72432
1715176150,1147.200000,72432
1715176151,1308.400000,82401
1715176151,1308.400000,82401
1715176152,1308.400000,82699
1715176152,1308.400000,82699
1715176153,1308.400000,87482
1715176153,1308.400000,87482
1715176154,1312.300000,97448
1715176154,1312.300000,97448
1715176155,1312.300000,97748
1715176155,1312.300000,97748
1715176156,1312.300000,102587
1715176156,1312.300000,102587
1715176157,1311.800000,112425
1715176157,1311.800000,112425
1715176158,1311.800000,112725
1715176158,1311.800000,112725
1715176159,1311.800000,117587
1715176159,1311.800000,117587
1715176160,1295.600000,127318
1715176160,1295.600000,127318
1715176161,1295.600000,127619
1715176161,1295.600000,127619
1715176162,1295.600000,131377
1715176162,1295.600000,131377
1715176163,1281.100000,137463
1715176163,1281.100000,137463
1715176164,1281.100000,137537
1715176164,1281.100000,137537
1715176165,1281.100000,137537
1715176165,1281.100000,137537
1715176166,1281.100000,137537
1715176166,1281.100000,137537
1715176167,1281.100000,137537
1715176167,1281.100000,137537
1715176168,1281.100000,137537
1715176168,1281.100000,137537
1715176169,1281.100000,137537
1715176169,1281.100000,137537
1715176170,1281.100000,137537
1715176170,1281.100000,137537
1715176171,1281.100000,137537
1715176171,1281.100000,137537
1715176172,1281.100000,137537
1715176172,1281.100000,137537
1715176173,1281.100000,137537
1715176173,1281.100000,137537
1715176174,1281.100000,137537
1715176174,1281.100000,137537
1715176175,1281.100000,137537
1715176175,1281.100000,137537
1715176176,1281.100000,137537
1715176176,1281.100000,137537
1715176177,1281.100000,137537
1715176177,1281.100000,137537
1715176178,1281.100000,137537
1715176178,1281.100000,137537
1715176179,1281.100000,137537
1715176179,1281.100000,137537
1715176180,1281.100000,137537
1715176180,1281.100000,137537
Even if we ignore de warmup or when the test didn't even start and consider 1715176110
as the start time and 1715176180
as the ending time, so 70 seconds in total, that gives us 137537 / 70 = ~1964.81
RPS.
If I take into account the timestamp where I press "start" which should be 1715176131
, that's 49 seconds, so ~2806.87 RPS
.
Or we can look at a subset of the test:
1715176151,1308.400000,82401
1715176152,1308.400000,82699
1715176152,1308.400000,82699
1715176153,1308.400000,87482
1715176153,1308.400000,87482
1715176154,1312.300000,97448
1715176154,1312.300000,97448
1715176155,1312.300000,97748
1715176155 - 1715176151 = 4s
and 97748 - 82401 = 15347 req
so 15347 / 4 = ~3836 RPS
I'm not sure if the issue is that some requests do not finish, would those request be accounted in the statistics as total?
Thank you for looking into this.
If requests never finish (and dont time out either), then they are not logged in Locust at all.
You may want to reduce the timeout (using FastHttpUser.network_timeout and .connection_timeout, default is 60 seconds) and see if that makes a difference.
You can also add --stop-timeout argument to allow locust to gracefully exit when shuttning down (this could make it even more obvious if something has gone wrong and it is stuck somehow).
Also, can you share some log (from master and worker).
Sure, I'll share some logs tomorrow. I'll also review the logs and documentation tomorrow. In the meantime, I have a couple of questions:
class LoadTest(FastHttpUser):
connection_timeout = 5
network_timeout = 10
1) Yes!
2) Yes, use the --expect-workers
Hi!
I've been trying to reproduce it with 1 or 2 servers without success. It only happens when I increase significantly the number of worker servers and the overall RPS is higher.
I'm attaching the logs from the following test, I tried not to trigger cpu warnings on any worker:
-u 10000 -r 10 --run-time 45s --expect-workers 360 --stop-timeout 3
FastHttpUser
, connection_timeout = 5
and network_timeout = 10
peak RPS seen on the target server = ~80000
peak RPS on locust = ~7920.7
total requests on locust = 1820750
total requests on the server log = ~2007200
test_start = 1715255664
test_end = 1715255706
runtime = 42s
"average" rps = 1820750 / 42 = 43351
This is with a logging level of INFO, I could do an identical test in DEBUG level and share it if it doesn't contain sensitive data (or just remove it first).
Hi! I dont see anything strange in your logs. Try setting DEBUG log level. If the log becomes overwhelming try reducing the number of workers. Could it possibly be that workers dont properly report the last requests (they should, but there could be a bug or something).
Another thing you might try is installing locust-plugins and using its iteration limit (-i) and see if the numbers match then (https://github.com/SvenskaSpel/locust-plugins/blob/507ff46808d2945f2eea6ed46aa2d69117159ede/examples/cmd_line_examples.sh#L10).
It shouldnt be needed of course, but might narrow down where the problem originates from.
I enabled DEBUG log level and "reproduced" it again, but I don't see anything wrong in the logs.
I also tried to reproduce it without success locally with a dummy go http server and 18 workers as docker replicas. This test had a peak of 32k RPS and all the numbers matched.
I cannot do many more tests on the real environment, I'm inclined to think that it's related to the network.
Is there a way to "cooldown" the test? For example increase the users for the first X seconds and then decrease them until 0 before ending the run.
You can use a load shape https://docs.locust.io/en/latest/custom-load-shape.html#custom-load-shapes
But --stop-timeout should have been enough (you ran with that when the logs were generated, right?). With that setting, either users should finish their tasks or there would be an error message, saying they had to be interrupted.
Ok, I think I found out what was happening and it's not locust fault.
I was checking how the stats are computed: https://github.com/locustio/locust/blob/58515c41e2a07d3aa55772c5cb2b88bbbdf75057/locust/stats.py#L313-L314 And messing up with this locust file on the docker environment to see how the stats reporting worked:
import json
import logging
import time
from collections import Counter
from pathlib import Path
from locust import FastHttpUser, events, task
logger = logging.getLogger(__name__)
reports = Counter()
@events.report_to_master.add_listener
def on_report_to_master(client_id, data, **_):
reports[client_id] += 1
data["report_timestamp"] = time.time()
data["report_num"] = reports[client_id]
@events.worker_report.add_listener
def on_worker_report(client_id, data, **_):
out = Path(f"/mnt/locust/logs/{data['report_num']}_{client_id}.json")
out.write_text(json.dumps(data))
@events.quit.add_listener
def on_quit(**_):
for k, v in reports.items():
out = Path(f"/mnt/locust/logs/{k}.reports")
out.write_text(str(v))
class MyUser(FastHttpUser):
connection_timeout = 5
network_timeout = 10
@task
def hit_webserver(self):
self.client.get("/")
Since each worker sends the RPS stored in a dict with the timestamp as the key, and the timestamp comes from each worker server date, I checked the date on all 18 servers. I found that 4 of them had an offset of more than 1-2 seconds, with one server being 160 seconds behind!
If this is the case, in theory we should be able to reproduce it in a docker environment using https://github.com/wolfcw/libfaketime, setting an offset for each worker and overriding the date before reporting. I'm not sure if something like this would work.
I've now synced all the servers and ran a couple of tests that previously reported incorrect RPS and now the numbers seem to match. Does it make sense to you? If it does feel free to close this.
Thank you.
There used to be a feature that detected time difference between master and worker, but it had to be disabled (4+ years ago) due to false positives. I’ll have a look at reintroducing it!
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.
Prerequisites
Description
The test is executed across 17 machines with a total of 170 workers. I've done this in the past and the RPS Chart was accurate (I think locust was on
1.x.x
back then).All the servers including the master have the version
2.27.0
.For context, the chart shows a maximum of 14.000 RPS, however the real time reporting tools in the webserver that is receiving all the traffic (and it is only receiving locust trafic) reports 100.000 to 200.000 RPS during the test (a varnish instance, checking it with
varnishstat
). I also confirm that it is true with the logs, I truncate them before each test and locust is sending all the requests but it's not reporting the RPS correctly.EDIT: RPS is shown wrong both in the chart and in the statistics (web or terminal)
Command line
locust -f $FILE --master --master-bind-port 5557 --host "https://" -u 80000 -r 30 --tags hit300
Locustfile contents
Python version
3.12.1
Locust version
2.27.0
Operating system
Debian 12