Possibly performance regression in the latest versions of locust

morrisonli76 commented 5 months ago

Prerequisites

[X] I am using the latest version of Locust
[X] I am reporting a bug, not asking a question

Description

I used to use Amazon Linux 2 as the base OS for my load tests. Because the python available on that OS is 3.7, the latest locust I could get was 2.17.0. With 5 c5n.xlarge EC2 instances (each has 4 vCPU) as workers, I could use spawn 1200 users. The wait_time for the test was set to constant_thoughtput(1) so that the total 1200 rps stress could be achieved.

Recently, I updated the base OS to Amazon Linux 2023. The python version became 3.11. I could use the latest version of locust - 2.26.0. However, the above setup (5 c5n.xlarge EC2 instances) could not provide the desired load. It could only spawn totally about 830 users. The total rsp was only around 330 even though the wait_time was still constant_thoughtput(1). I noticed that CPU usage of each worker process was close to 100% already.

The server being tested did not change. The same locustfile was used for tests. However, the performance between the above 2 locust setup was day and night difference. This seems like a regression.

Here is the output of the python 3.11 environment: Package Version

blinker 1.7.0 Brotli 1.1.0 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 ConfigArgParse 1.7 Flask 3.0.3 Flask-Cors 4.0.0 Flask-Login 0.6.3 gevent 24.2.1 geventhttpclient 2.2.1 greenlet 3.0.3 idna 3.7 itsdangerous 2.2.0 Jinja2 3.1.3 locust 2.26.0 MarkupSafe 2.1.5 msgpack 1.0.8 pip 22.3.1 psutil 5.9.8 pyzmq 26.0.2 requests 2.31.0 roundrobin 0.0.4 setuptools 65.5.1 urllib3 2.2.1 Werkzeug 3.0.2 zope.event 5.0 zope.interface 6.3

Command line

master side: locust -f /opt/locustfile.py --master worker side: locust -f - --worker --master-host --processes -1

Locustfile contents

class QuickstartUser(HttpUser):
    def on_start(self):
        self.pixel_ids = self.environment.parsed_options.pixel_ids.split(",")
        self.client.verify = True if self.environment.parsed_options.verify_cert.lower() == "true" else False

    @task
    def cloudbridge(self):
        pixel_id = random.choice(self.pixel_ids)
        event_body = {
            "fb.pixel_id": pixel_id,
            "event_id": generate_event_id(),
            "event_name": self.environment.parsed_options.event_name,
            "conversion_value": {
                "value": "9",
                "currency": "USD",
            },
        }
        self.client.post(self.environment.parsed_options.path, json=event_body, name="event")
        self.client.close()

    wait_time = constant_throughput(2)

Python version

3.11

Locust version

2.26.0

Operating system

Amazon Linux 2023

cyberw commented 5 months ago

Hmm... There IS a known performance regression in OpenSSL 3.x (which was usually introduced in Python 3.12, but maybe your python build is different somehow?), see https://github.com/locustio/locust/issues/2555

The issue will hit tests which close/reopen the connection especially hard (as the issue arises at ssl negotiation)

Can you check to see which ssl version you are running? python -c "import ssl; print(ssl.OPENSSL_VERSION)"

As a workaround, see if you can run run another python version or keep connections alive (I know, not as realistic but better than nothing)

morrisonli76 commented 5 months ago

Hi, I used ubuntu 20.04 for Amazon EC2. I managed install the python 3.10 and the latest locust.

The CPU usage became low. However, the through put did not follow the constant_throughput(1) spec. 1500 users only gave me less than 800 rps.

Here is my python env:

(locust_env) ubuntu@ip-172-31-10-204:~$ locust -V locust 2.26.0 from /opt/locust_env/lib/python3.10/site-packages/locust (python 3.10.14) (locust_env) ubuntu@ip-172-31-10-204:~$ python3.10 -m pip list Package Version

blinker 1.8.1 Brotli 1.1.0 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 ConfigArgParse 1.7 Flask 3.0.3 Flask-Cors 4.0.0 Flask-Login 0.6.3 gevent 24.2.1 geventhttpclient 2.2.1 greenlet 3.0.3 idna 3.7 itsdangerous 2.2.0 Jinja2 3.1.3 locust 2.26.0 MarkupSafe 2.1.5 msgpack 1.0.8 pip 24.0 psutil 5.9.8 pyzmq 26.0.2 requests 2.31.0 roundrobin 0.0.4 setuptools 69.5.1 tomli 2.0.1 urllib3 2.2.1 Werkzeug 3.0.2 wheel 0.43.0 zope.event 5.0 zope.interface 6.3

cyberw commented 4 months ago

Hi! Did you check your ssl version?

python -c "import ssl; print(ssl.OPENSSL_VERSION)"

morrisonli76 commented 4 months ago

Yes, I did that. In fact I used ubuntu 20.04 which uses openssl 1.1.1f. I also updated the python to 3.10. With this setup, the CPU usage was lower, however, I found that even if I set wait = constant_throughput(1) for the test user, 1500 users only gave me less than 800 rps (I have already mentioned this in my previous reply). I did not see this issue when I use locust 2.17.0.

cyberw commented 4 months ago

What are your response times like? Wait times can only limit throughput, not increase it, so if a task takes more than 1s to complete you wont get 1 request/user/s.

morrisonli76 commented 4 months ago

The average response time is less than 700ms. Also, when I used older version of locust (e.g. 2.17.0), I did not have this issue.

cyberw commented 4 months ago

Hmm.. only thing I can think of is if Amazon is throttling somehow. What if you skip closing the session/connection? Can you see how many dns lookups are made? (Using tcpdump or something else). If you close the session then maybe there is a new dns lookup for each task iteration?

morrisonli76 commented 4 months ago

I can take a look if there is new dns lookup. However, the same target server and same tests, why locust 2.17.0 did not have the issue. Any major change to the connection logic?

cyberw commented 4 months ago

Not that I can think of :-/ But does 2.17.0 not exhibit this problem on python 3.11/Amazon Linux 2023?

morrisonli76 commented 3 months ago

Just report back. I changed my system combination. Right now, I am using Amazon Linux 2 with Python 3.10. The ssl version is 1.1.1g. I also follow the instruction https://repost.aws/knowledge-center/dns-resolution-failures-ec2-linux to enable the local dns cache. With this setup, the latency is much lower and CPU usage per worker is at low level as well.

However, even with this setup, the RPS does not hold. I run a test with 1200 users, each with constant_throughput(1) request rate. the RPS is quite far from 1200. It stopped around 800 and started to drop on its own.

cyberw commented 3 months ago

What are the response times? If a task takes more than the constant_pacing time, you’ll get falling throughput.

morrisonli76 commented 3 months ago

I tried to run the locust 2.17 on the exact same OS (Amazon Linux 2 with Python 3.10). It also showed the same issue. I think the issue is on the load test side because the server being tested is the same. I suspect there could be something in the OS environment that slows down the connection.

However, one thing I don't understand is that when the number of users reaches the desired number, the rps can not reach the expected number and starts to drop and eventually drop to a very low number. It seems locust loses control of creating new connections.

I have enabled local dns cache. Anything else would you suggest me to try out?

Thanks

cyberw commented 3 months ago

The main thing I would like to investigate is on the receiving end. Is there some throttling going on? How many locust workers are you using? Are they spread out over multiple machines? Are they passing thru a NAT?

However, one thing I don't understand is that when the number of users reaches the desired number, the rps can not reach the expected number and starts to drop and eventually drop to a very low number. It seems locust loses control of creating new connections.

Again I ask: What are your response times? If response times increase enough, you'll get falling RPS. Nothing to do with Locust, it is just math: If you have a certain number of concurrent users and response times go up you'll get falling throughput.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

morrisonli76 commented 1 month ago

Just got the latest locust 2.31. Everything else was the same. The above issue was resolved. Any major improvement in 2.31?

cyberw commented 1 month ago

There was a performance fix in requests 2.32.0, but it should really only be needed for openssl 3.x, which you didn't have :) https://github.com/psf/requests/releases/tag/v2.32.0

But its nice that it works for you now :) Ok to close?

cyberw commented 1 month ago

Or maybe what you were experiencing was a version of this: https://github.com/locustio/locust/issues/2812 ? That was fixed in Locust 2.31.

locustio / locust