Closed viktorkertesz closed 1 year ago
@viktorkertesz looks like you're using a custom queue
for this job. since the screen shot cuts off the rest of the celery command I'm curious what other config are you using for this celery queue?
@jeffkala thanks for looking at it.
I tried using the default
queue worker which runs along with nautobot-server as well with the same problem. But it has even less resources hence we ran this job in our external worker. I haven't seen more logs by there.
Here is how we run it:
/usr/local/bin/nautobot-server celery worker --loglevel INFO --queues $QUEUES --max-memory-per-child 2000000 --concurrency 2 &
QUEUES is currently long
where we intend to run long run jobs.
Relevant configuration items from nautobot_config.py:
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
RQ_QUEUES = {
"default": {
"HOST": REDIS_HOST,
"PORT": 6379,
"DB": 0,
"PASSWORD": REDIS_PASSWORD,
"SSL": False,
"DEFAULT_TIMEOUT": 300,
},
"webhooks": {
"HOST": REDIS_HOST,
"PORT": 6379,
"DB": 0,
"PASSWORD": REDIS_PASSWORD,
"SSL": False,
"DEFAULT_TIMEOUT": 300,
},
"check_releases": {
"HOST": REDIS_HOST,
"PORT": 6379,
"DB": 0,
"PASSWORD": REDIS_PASSWORD,
"SSL": False,
"DEFAULT_TIMEOUT": 300,
},
"custom_fields": {
"HOST": REDIS_HOST,
"PORT": 6379,
"DB": 0,
"PASSWORD": REDIS_PASSWORD,
"SSL": False,
"DEFAULT_TIMEOUT": 300,
},
}
# Nautobot uses Cacheops for database query caching. These are the following defaults.
# For detailed configuration see: https://github.com/Suor/django-cacheops#setup
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/1",
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
},
}
}
CACHEOPS_REDIS = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/1"
CELERY_BROKER_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0"
CELERY_TASK_SOFT_TIME_LIMIT = 12 * 60 * 60
CELERY_TASK_TIME_LIMIT = 16 * 60 * 60
CELERY_RESULT_BACKEND = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:6379/0"
Please tell me if it's not enough or you are curious about different configuration!
Thanks!!
Hello NTC, It turned out to be an error in one of our job. Problem was that a job unintentionally overwrote the nornir processor setting to retry processor from salty package which affected all other plugins and jobs. This was a mistake on our part but this is also a consideration of the design of plugin settings. It should be read only i think. Correct me if I'm wrong!
I think this issue can be closed, but if you think it would worth a look on why PLUGIN_SETTINGS is global amongst all jobs and moreover it's also writable globally. A copy of it on each job start is an easy fix i think.
Thanks!
Viktor
Thanks @viktorkertesz will close out for now. Not following why PLUGIN_SETTINGS would be overwritten in general. Was this something one of your plugins was doing?
Environment
When running golden config tasks on relatively large number of devices (>1000) we ran out of memory on our worker PODs and the task getting killed by the OS. I need some advice where to look in order to solve the issue. Actually, I have two problems with the above behavior:
Running
Steps to Reproduce
When I run the same job for small number of devices, it's working fine.
Expected Behavior
Memory consumption shouldn't be so high but somewhat linear of the targeted device number.
Observed Behavior
I am watching CPU and memory consumption while the job starts running.
Log on the POD: