locustio / locust

Write scalable load tests in plain Python 🚗💨
MIT License
24.65k stars 2.96k forks source link

Request latencies are higher when spawning users faster, even for same number of users / requests/s #2767

Open nioncode opened 3 months ago

nioncode commented 3 months ago

Prerequisites

Description

I'm running locust in distributed mode on 4 VMs, with 4 workers each, i.e. 16 workers and store the stats history via the --csv option to generate our own graphs from it afterwards. While playing around with locust I noticed that it seems to make a difference how fast users are spawned, i.e. when spawning users faster, the response times are higher (even for the same number of requests/s).

E.g. consider these two scenarios:

  1. Running with -r 10 results in a 99% response time of 22ms for 300 users and ~220 requests/s
  2. Running with -r 50 results in a 99% response time of 58ms for 450 users and ~220 requests/s (for 300 users there are only ~103 requests/s and still a 99% response time of 60ms)

If you check the attached CSV files, you can easily see that the response times are far worse for the -r 50 test run across almost every dimension, even for very low user / requests/s. When increasing using -r 100 or even higher, this problem gets even worse. Am I doing something wrong or is this expected in some way?

test_stats_history_r10.csv test_stats_history_r50.csv

Maybe related question: should there be any difference between the following two scenarios:

I've seen much worse results from using the first option compared to the second one (but the first one mimics our real world use case better). How can I find out what the problem is here, since it seems to be one of locust or the network and not our server (since both result in 5k requests/s)? I have a feeling that it might be that network connections are routinely dropped and re-established, but I have no idea if this is correct or not.

Command line

locust -f test.py --headless --master --expect-workers 16 -u 5000 -r 10 --run-time 60s --csv test -H https://my-host

Locustfile contents

from locust import HttpUser, task, between
import uuid

class Health(HttpUser):
    wait_time = between(0.5, 1.5)  # do a tx roughly every 1s

    @task
    def query_health(self):
        self.client.get(f"/api/v0/health", verify=False)

Python version

3.12.3

Locust version

2.29.0

Operating system

Ubuntu 24.04

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

nioncode commented 1 month ago

Still relevant to me.

cyberw commented 1 month ago

Hi @nioncode ! Sorry for not replying, I must have missed it.

This could be caused by any number of issues outside of Locust's control (max connection count on server or worker side, throttling new connections in load balancer etc).

If you can reproduce this in a way that rules out server side issues and I can run myself (like a local nginx instance or something) I'd be happy to take a look.

nioncode commented 2 weeks ago

I probably can't set up such an easy to reproduce setup, since I run this distributed across multiple workers + target VMs in Google Cloud Platform (without a load balancer, directly accessing the VMs over their public IP).

The number of connections etc. should all be the same for the same number of users (right?), since each user uses their own connection.

cyberw commented 2 weeks ago

I probably can't set up such an easy to reproduce setup, since I run this distributed across multiple workers + target VMs in Google Cloud Platform (without a load balancer, directly accessing the VMs over their public IP).

Without a solid way to reproduce this I can't investigate (most of the time these issues are caused by things outside of locusts control - often by there being an actual performance issue in the system you are testing :)

The number of connections etc. should all be the same for the same number of users (right?), since each user uses their own connection.

Yes.