Incorrect RPS reported in statistics and graphs (distributed mode)

aorith commented 2 months ago

Prerequisites

[X] I am using the latest version of Locust
[X] I am reporting a bug, not asking a question

Description

The test is executed across 17 machines with a total of 170 workers. I've done this in the past and the RPS Chart was accurate (I think locust was on 1.x.x back then).

All the servers including the master have the version 2.27.0.

For context, the chart shows a maximum of 14.000 RPS, however the real time reporting tools in the webserver that is receiving all the traffic (and it is only receiving locust trafic) reports 100.000 to 200.000 RPS during the test (a varnish instance, checking it with varnishstat). I also confirm that it is true with the logs, I truncate them before each test and locust is sending all the requests but it's not reporting the RPS correctly.

EDIT: RPS is shown wrong both in the chart and in the statistics (web or terminal)

Command line

locust -f $FILE --master --master-bind-port 5557 --host "https://" -u 80000 -r 30 --tags hit300

Locustfile contents

from locust import HttpUser, task, tag
from locust.contrib.fasthttp import FastHttpUser

TEST300 = "/test-300"  # 300 bytes
TEST1K = "/test-1k"  # 1024 bytes
TEST1M = "/test-1M"  # 1M bytes
TEST10M = "/test-10M"  # 10M bytes

HEADERS = {"X-Forwarded-Proto": "https", "Accept-Encoding": "none", "test": "locust"}
HEADERS.update(
    {"Host": "www.somedomain.com"}
)

class QuickstartUser(FastHttpUser):
    @tag("hit300")
    @task
    def hitsl(self):
        self.client.get(TEST300, headers=HEADERS)

    @tag("hit1k")
    @task
    def hitsl1(self):
        self.client.get(TEST1K, headers=HEADERS)

    @tag("hit1M")
    @task
    def hitsl1m(self):
        self.client.get(TEST1M, headers=HEADERS)

    @tag("hit10M")
    @task
    def hitsl10m(self):
        self.client.get(TEST10M, headers=HEADERS)

Python version

3.12.1

Locust version

2.27.0

Operating system

Debian 12

cyberw commented 2 months ago

Hi!

Is it possible that the requests don't actually finish? Locust only counts finished requests.
Is the problem present at lower load?
Is the problem present with both HttpUser and FastHttpUser?

cyberw commented 2 months ago

Oh and one more thing: is it only the graphs or also the command line output that is wrong? (either run --headless or add --print-stats)

aorith commented 2 months ago

Hey!

I've been testing using HTTPUser and I get the same result, also it's not only the chart, the statistics are also wrong, which is also visible when running the master with --print-stats as you pointed out!

I've ran a smaller test (same settings but limited to 50 users, only 17 workers (1 per node) and HTTPUser class). I also added --csv $file --csv-full-history so you can see some numbers.

out.csv_stats_history.csv

$ awk -F',' '{printf $1","$5","$18"\n"}' out.csv_stats_history.csv
Timestamp,Requests/s,Total Request Count
1715176110,0.000000,0
1715176111,0.000000,0
1715176112,0.000000,0
1715176113,0.000000,0
1715176114,0.000000,0
1715176115,0.000000,0
1715176116,0.000000,0
1715176117,0.000000,0
1715176118,0.000000,0
1715176119,0.000000,0
1715176120,0.000000,0
1715176121,0.000000,0
1715176122,0.000000,0
1715176123,0.000000,0
1715176124,0.000000,0
1715176125,0.000000,0
1715176126,0.000000,0
1715176127,0.000000,0
1715176128,0.000000,0
1715176129,0.000000,0
1715176130,0.000000,0
1715176131,0.000000,0
1715176132,0.000000,408
1715176132,0.000000,408
1715176133,0.000000,781
1715176133,0.000000,781
1715176134,0.000000,830
1715176134,0.000000,830
1715176135,0.000000,3918
1715176135,0.000000,3918
1715176136,36.200000,8341
1715176136,36.200000,8341
1715176137,36.200000,8566
1715176137,36.200000,8566
1715176138,36.200000,13239
1715176138,36.200000,13239
1715176139,97.100000,22190
1715176139,97.100000,22190
1715176140,97.100000,22489
1715176140,97.100000,22489
1715176141,97.100000,27279
1715176141,97.100000,27279
1715176142,447.800000,37087
1715176142,447.800000,37087
1715176143,447.800000,37385
1715176143,447.800000,37385
1715176144,447.800000,42201
1715176144,447.800000,42201
1715176145,815.500000,52127
1715176145,815.500000,52127
1715176146,815.500000,52427
1715176146,815.500000,52427
1715176147,815.500000,57226
1715176147,815.500000,57226
1715176148,1147.200000,67299
1715176148,1147.200000,67299
1715176149,1147.200000,67600
1715176149,1147.200000,67600
1715176150,1147.200000,72432
1715176150,1147.200000,72432
1715176151,1308.400000,82401
1715176151,1308.400000,82401
1715176152,1308.400000,82699
1715176152,1308.400000,82699
1715176153,1308.400000,87482
1715176153,1308.400000,87482
1715176154,1312.300000,97448
1715176154,1312.300000,97448
1715176155,1312.300000,97748
1715176155,1312.300000,97748
1715176156,1312.300000,102587
1715176156,1312.300000,102587
1715176157,1311.800000,112425
1715176157,1311.800000,112425
1715176158,1311.800000,112725
1715176158,1311.800000,112725
1715176159,1311.800000,117587
1715176159,1311.800000,117587
1715176160,1295.600000,127318
1715176160,1295.600000,127318
1715176161,1295.600000,127619
1715176161,1295.600000,127619
1715176162,1295.600000,131377
1715176162,1295.600000,131377
1715176163,1281.100000,137463
1715176163,1281.100000,137463
1715176164,1281.100000,137537
1715176164,1281.100000,137537
1715176165,1281.100000,137537
1715176165,1281.100000,137537
1715176166,1281.100000,137537
1715176166,1281.100000,137537
1715176167,1281.100000,137537
1715176167,1281.100000,137537
1715176168,1281.100000,137537
1715176168,1281.100000,137537
1715176169,1281.100000,137537
1715176169,1281.100000,137537
1715176170,1281.100000,137537
1715176170,1281.100000,137537
1715176171,1281.100000,137537
1715176171,1281.100000,137537
1715176172,1281.100000,137537
1715176172,1281.100000,137537
1715176173,1281.100000,137537
1715176173,1281.100000,137537
1715176174,1281.100000,137537
1715176174,1281.100000,137537
1715176175,1281.100000,137537
1715176175,1281.100000,137537
1715176176,1281.100000,137537
1715176176,1281.100000,137537
1715176177,1281.100000,137537
1715176177,1281.100000,137537
1715176178,1281.100000,137537
1715176178,1281.100000,137537
1715176179,1281.100000,137537
1715176179,1281.100000,137537
1715176180,1281.100000,137537
1715176180,1281.100000,137537

Even if we ignore de warmup or when the test didn't even start and consider 1715176110 as the start time and 1715176180 as the ending time, so 70 seconds in total, that gives us 137537 / 70 = ~1964.81 RPS.

If I take into account the timestamp where I press "start" which should be 1715176131, that's 49 seconds, so ~2806.87 RPS.

Or we can look at a subset of the test:

1715176151,1308.400000,82401
1715176152,1308.400000,82699
1715176152,1308.400000,82699
1715176153,1308.400000,87482
1715176153,1308.400000,87482
1715176154,1312.300000,97448
1715176154,1312.300000,97448
1715176155,1312.300000,97748

1715176155 - 1715176151 = 4s and 97748 - 82401 = 15347 req so 15347 / 4 = ~3836 RPS

I'm not sure if the issue is that some requests do not finish, would those request be accounted in the statistics as total?

Thank you for looking into this.

cyberw commented 2 months ago

If requests never finish (and dont time out either), then they are not logged in Locust at all.

You may want to reduce the timeout (using FastHttpUser.network_timeout and .connection_timeout, default is 60 seconds) and see if that makes a difference.

You can also add --stop-timeout argument to allow locust to gracefully exit when shuttning down (this could make it even more obvious if something has gone wrong and it is stuck somehow).

cyberw commented 2 months ago

Also, can you share some log (from master and worker).

aorith commented 2 months ago

Sure, I'll share some logs tomorrow. I'll also review the logs and documentation tomorrow. In the meantime, I have a couple of questions:

For the timeouts, can I set them as class attributes as shown below?

class LoadTest(FastHttpUser):
    connection_timeout = 5
    network_timeout = 10

To run the tests completely headless, can I setup the master in a way that the test isn't started until X workers are attached?

cyberw commented 2 months ago

1) Yes!

2) Yes, use the --expect-workers parameter for that!

aorith commented 2 months ago

Hi!

I've been trying to reproduce it with 1 or 2 servers without success. It only happens when I increase significantly the number of worker servers and the overall RPS is higher.

I'm attaching the logs from the following test, I tried not to trigger cpu warnings on any worker:

18 nodes with the worker role
master params: -u 10000 -r 10 --run-time 45s --expect-workers 360 --stop-timeout 3
same original locust file, with FastHttpUser, connection_timeout = 5 and network_timeout = 10

peak RPS seen on the target server = ~80000
peak RPS on locust                 = ~7920.7
total requests on locust           = 1820750
total requests on the server log   = ~2007200

test_start    = 1715255664
test_end      = 1715255706
runtime       = 42s
"average" rps = 1820750 / 42 = 43351

locust_test.tar.gz

This is with a logging level of INFO, I could do an identical test in DEBUG level and share it if it doesn't contain sensitive data (or just remove it first).

cyberw commented 2 months ago

Hi! I dont see anything strange in your logs. Try setting DEBUG log level. If the log becomes overwhelming try reducing the number of workers. Could it possibly be that workers dont properly report the last requests (they should, but there could be a bug or something).

Another thing you might try is installing locust-plugins and using its iteration limit (-i) and see if the numbers match then (https://github.com/SvenskaSpel/locust-plugins/blob/507ff46808d2945f2eea6ed46aa2d69117159ede/examples/cmd_line_examples.sh#L10).

It shouldnt be needed of course, but might narrow down where the problem originates from.

aorith commented 2 months ago

I enabled DEBUG log level and "reproduced" it again, but I don't see anything wrong in the logs.

I also tried to reproduce it without success locally with a dummy go http server and 18 workers as docker replicas. This test had a peak of 32k RPS and all the numbers matched.

I cannot do many more tests on the real environment, I'm inclined to think that it's related to the network.

Is there a way to "cooldown" the test? For example increase the users for the first X seconds and then decrease them until 0 before ending the run.

cyberw commented 2 months ago

You can use a load shape https://docs.locust.io/en/latest/custom-load-shape.html#custom-load-shapes

But --stop-timeout should have been enough (you ran with that when the logs were generated, right?). With that setting, either users should finish their tasks or there would be an error message, saying they had to be interrupted.

aorith commented 2 months ago

Ok, I think I found out what was happening and it's not locust fault.

I was checking how the stats are computed: https://github.com/locustio/locust/blob/58515c41e2a07d3aa55772c5cb2b88bbbdf75057/locust/stats.py#L313-L314 And messing up with this locust file on the docker environment to see how the stats reporting worked:

import json
import logging
import time
from collections import Counter
from pathlib import Path

from locust import FastHttpUser, events, task

logger = logging.getLogger(__name__)

reports = Counter()

@events.report_to_master.add_listener
def on_report_to_master(client_id, data, **_):
    reports[client_id] += 1
    data["report_timestamp"] = time.time()
    data["report_num"] = reports[client_id]

@events.worker_report.add_listener
def on_worker_report(client_id, data, **_):
    out = Path(f"/mnt/locust/logs/{data['report_num']}_{client_id}.json")
    out.write_text(json.dumps(data))

@events.quit.add_listener
def on_quit(**_):
    for k, v in reports.items():
        out = Path(f"/mnt/locust/logs/{k}.reports")
        out.write_text(str(v))

class MyUser(FastHttpUser):
    connection_timeout = 5
    network_timeout = 10

    @task
    def hit_webserver(self):
        self.client.get("/")

Since each worker sends the RPS stored in a dict with the timestamp as the key, and the timestamp comes from each worker server date, I checked the date on all 18 servers. I found that 4 of them had an offset of more than 1-2 seconds, with one server being 160 seconds behind!

If this is the case, in theory we should be able to reproduce it in a docker environment using https://github.com/wolfcw/libfaketime, setting an offset for each worker and overriding the date before reporting. I'm not sure if something like this would work.

I've now synced all the servers and ran a couple of tests that previously reported incorrect RPS and now the numbers seem to match. Does it make sense to you? If it does feel free to close this.

Thank you.

cyberw commented 2 months ago

There used to be a feature that detected time difference between master and worker, but it had to be disabled (4+ years ago) due to false positives. I’ll have a look at reintroducing it!

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

locustio / locust