locustio / locust

Write scalable load tests in plain Python 🚗💨
https://locust.cloud
MIT License
24.86k stars 2.98k forks source link

Possible memory leak ? #2933

Open NicoAdrian opened 2 weeks ago

NicoAdrian commented 2 weeks ago

Prerequisites

Description

Memory usage keeps growing during the test, eventually causing my system to go out of RAM. RAM increase stops when all the users have been spawned (I launch a test with 30k) Stopping the test doesn't free the memory, nor launching another one. Launching another one in the same conditions doesn't increase RAM however I'm using FastHttpClient fyi

Command line

/usr/local/bin/locust --processes -1 --class-picker --web-host 0.0.0.0 --logfile=/var/log/locust

Locustfile contents

@task
def req(self):
    t0 = time()
    with self.client.get(self.url, headers={"User-Agent": self.ua}, catch_response=True) as resp:
        response_time = time() - t0
        if response_time > 2:
            resp.failure(f"Request took too long: {response_time:.3f}")
        if resp.status_code >= 400:
            self.errors += 1
            resp.failure(f"Got HTTP {resp.status_code}")
            if self.errors == 5:
                logging.warning("Too many errors, stopping user")
                raise StopUser
        else:
            self.errors = 0

Python version

3.11

Locust version

2.31.8

Operating system

Linux 6.1.97-104.177.amzn2023.x86_64

AdityaS8804 commented 2 days ago

Hey, How about we modify the FastHttpSession in fasthttp.py to explicitly call self.client.close() when the test stops. This will result in any remaining HTTP connections or data held in memory are freed.
I'd like to contribute to this issue. Please let me know if this is a valid approach. Your inputs are much appreciated.

cyberw commented 2 days ago

I think that makes a lot of sense. A user can never come back to life when it is stopped, so it should close its connection and clean up any resources asap.

Not sure exactly where to implement it though. on_stop isnt good, because it could be overridden in a subclass. __del__() could work but I'm not sure if it happens soon enough (and at that time we're probably close to closing the connection anyway). But you're welcome to try it out!

NicoAdrian commented 2 days ago

Ok fine but how can 30k users take ~30Gb of RAM ? has anyone experienced this before ? I mean, is this the expected behaviour ?

cyberw commented 2 days ago

Oh no, that's definitely not the expected behaviour. If you can give me a minimal example that reproduces this I'll have a look. Can you see the same behaviour on any other platforms?

If you werent stopping/starting tons of users its unlikely to be resolved by explicitly closing sessions either.

There's one thing you might want to look into. This may create a lot of unique failures, which is bad because they are stored individually (probably not 30GB of data, but still :) resp.failure(f"Request took too long: {response_time:.3f}")

NicoAdrian commented 2 days ago

Oh no, that's definitely not the expected behaviour. If you can give me a minimal example that reproduces this I'll have a look. Can you see the same behaviour on any other platforms?

If you werent stopping/starting tons of users its unlikely to be resolved by explicitly closing sessions either.

There's one thing you might want to look into. This may create a lot of unique failures, which is bad because they are stored individually (probably not 30GB of data, but still :) resp.failure(f"Request took too long: {response_time:.3f}")

Can confirm I don't stop a lot of users (like a dozen, among 30k). I will try to comment the "request took too long line", if that helps. Here is my, somewhat (edited, because of business issues), full locustfile.py:

EDIT: I can't test this on other platforms, just Linux (centos)

import datetime
import logging
import re
from random import random
from time import sleep, time
from urllib.parse import quote

from locust import FastHttpUser, constant_pacing, events, task
from locust.exception import StopUser

PROXY_HOST = "someproxy.net"
PROXY_PORT = 8080

@events.init_command_line_parser.add_listener
def on_init_command_line_parser(parser):
    parser.add_argument("--test-id", default=datetime.datetime.now().strftime("%Y-%m-%d %H:%M"), help="Test ID")
    parser.add_argument("--env", default="blue", choices=["blue", "green"], help="Environment (blue/green)")
    parser.add_argument("--use-proxy", action="store_true")

class BaseUser(FastHttpUser):
    abstract = True
    network_timeout = 15

    def __init__(self, environment):
        if environment.parsed_options.use_proxy is True:
            self.proxy_host = PROXY_HOST
            self.proxy_port = PROXY_PORT
        super().__init__(environment)
        self.errors = 0
        self.host = self.host.format(env=environment.parsed_options.env)
        self.ua = f"Mozilla/5.0 Test_perf_test_id_{environment.parsed_options.test_id}__{int(random() * 10**16)}"

    @task
    def req(self):
        t0 = time()
        with self.client.get(self.url, headers={"User-Agent": self.ua}, catch_response=True) as resp:
            response_time = time() - t0
            if response_time > 2:
                resp.failure(f"Request took too long: {response_time:.3f}")
            if resp.status_code >= 400:
                self.errors += 1
                resp.failure(f"Got HTTP {resp.status_code}")
                if self.errors >= 5:
                    logging.warning("Too many errors, stopping user")
                    raise StopUser
            else:
                self.errors = 0

class DashUser(BaseUser):
    abstract = True
    wait_time = constant_pacing(2)

    def on_start(self):
        sleep(random())
        with self.client.get(
            f"{self.host}?{self.query}",
            headers={"User-Agent": self.ua},
            allow_redirects=False,
            catch_response=True,
        ) as resp:
            if resp.status_code != 302:
                resp.failure(f"on_start failed: {resp.status_code}")
                raise StopUser
            else:
                self.url = f"{self.host}/" + resp.headers["Location"]

class SomeUser(DashUser):
    # just strings
    query = "foo=bar"
    pass
cyberw commented 2 days ago

I need you to further narrow it down. Remove everything in that locustfile not needed to reproduce the issue. does it happen with just a basic FastHttpUser with a single request? If there is no problem then, keep adding stuff until you see it again.

I’m assuming there are no errors logged?