nautobot / nautobot-app-golden-config

Golden Configuration App for Nautobot.
https://docs.nautobot.com/projects/golden-config/en/latest/
Other
101 stars 57 forks source link

Potential memory leak / high memory consumption #467

Closed viktorkertesz closed 1 year ago

viktorkertesz commented 1 year ago

Environment

When running golden config tasks on relatively large number of devices (>1000) we ran out of memory on our worker PODs and the task getting killed by the OS. I need some advice where to look in order to solve the issue. Actually, I have two problems with the above behavior:

  1. When task is getting killed by SIGINT 9, there is no one left to notify Nautobot via redis/db and the task is listed as Running
  2. In the logs (console/db) there is no indication of rapid memory consumption, and I don't know how to debug these jobs efficiently.

Steps to Reproduce

  1. I select a region with devices (more than 1k)
  2. I run any golden config job (intended config is easy to reproduce without nagging network)

When I run the same job for small number of devices, it's working fine.

Expected Behavior

Memory consumption shouldn't be so high but somewhat linear of the targeted device number.

Observed Behavior

I am watching CPU and memory consumption while the job starts running.

  1. Initial phase is when inventory is collected by the main worker process. It's taking a while considering the amount of data stored for devices.
  2. When Nornir task kicks in, I see threads started working on the template rendering and CPU topping to max, no problem with that. Memory consumption starts building up slowly.
  3. After around 200-300 devices are logged completed, the memory consumption starts increasing rapidly. It's like 300MB/s and finally it reaches the allocated memory (8GB) and then OS kills the worker process with all threads along. intended_memory

Log on the POD:

[2023-04-12 10:41:18,042: INFO/ForkPoolWorker-2] Executing dispatcher for XXXXXX-XXXXXX (ios)
[2023-04-12 10:41:18,080: INFO/ForkPoolWorker-2] Found driver nornir_nautobot.plugins.tasks.dispatcher.cisco_ios.NautobotNornirDriver
[2023-04-12 10:41:18,103: INFO/ForkPoolWorker-2] Found driver nornir_nautobot.plugins.tasks.dispatcher.cisco_ios.NautobotNornirDriver
[2023-04-12 10:41:18,844: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:29 exited with 'signal 9 (SIGKILL)'
jeffkala commented 1 year ago

@viktorkertesz looks like you're using a custom queue for this job. since the screen shot cuts off the rest of the celery command I'm curious what other config are you using for this celery queue?

viktorkertesz commented 1 year ago

@jeffkala thanks for looking at it. I tried using the default queue worker which runs along with nautobot-server as well with the same problem. But it has even less resources hence we ran this job in our external worker. I haven't seen more logs by there.

Here is how we run it:

/usr/local/bin/nautobot-server celery worker --loglevel INFO --queues $QUEUES --max-memory-per-child 2000000 --concurrency 2 &

QUEUES is currently long where we intend to run long run jobs.

Relevant configuration items from nautobot_config.py:

REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
RQ_QUEUES = {
    "default": {
        "HOST": REDIS_HOST,
        "PORT": 6379,
        "DB": 0,
        "PASSWORD": REDIS_PASSWORD,
        "SSL": False,
        "DEFAULT_TIMEOUT": 300,
    },
    "webhooks": {
        "HOST": REDIS_HOST,
        "PORT": 6379,
        "DB": 0,
        "PASSWORD": REDIS_PASSWORD,
        "SSL": False,
        "DEFAULT_TIMEOUT": 300,
    },
    "check_releases": {
        "HOST": REDIS_HOST,
        "PORT": 6379,
        "DB": 0,
        "PASSWORD": REDIS_PASSWORD,
        "SSL": False,
        "DEFAULT_TIMEOUT": 300,
    },
    "custom_fields": {
        "HOST": REDIS_HOST,
        "PORT": 6379,
        "DB": 0,
        "PASSWORD": REDIS_PASSWORD,
        "SSL": False,
        "DEFAULT_TIMEOUT": 300,
    },
}

# Nautobot uses Cacheops for database query caching. These are the following defaults.
# For detailed configuration see: https://github.com/Suor/django-cacheops#setup

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/1",
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
        },
    }
}
CACHEOPS_REDIS = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/1"

CELERY_BROKER_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0"

CELERY_TASK_SOFT_TIME_LIMIT = 12 * 60 * 60

CELERY_TASK_TIME_LIMIT = 16 * 60 * 60

CELERY_RESULT_BACKEND = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0"

Please tell me if it's not enough or you are curious about different configuration!

Thanks!!

viktorkertesz commented 1 year ago

Hello NTC, It turned out to be an error in one of our job. Problem was that a job unintentionally overwrote the nornir processor setting to retry processor from salty package which affected all other plugins and jobs. This was a mistake on our part but this is also a consideration of the design of plugin settings. It should be read only i think. Correct me if I'm wrong!

I think this issue can be closed, but if you think it would worth a look on why PLUGIN_SETTINGS is global amongst all jobs and moreover it's also writable globally. A copy of it on each job start is an easy fix i think.

Thanks!

Viktor

jeffkala commented 1 year ago

Thanks @viktorkertesz will close out for now. Not following why PLUGIN_SETTINGS would be overwritten in general. Was this something one of your plugins was doing?