Open lfdominguez opened 1 year ago
"magically" the CPU is not using the 100% anymore.....
I also have this problem, but on kubernetes.
I also have this problem, on Docker (Compose), with the default / provided yml file in the documentation
what helped me: kill the worker processes, since then everything is fine
@soerendohmen that's kind of outside the point, as with docker/kubernetes your containers will eventually get restarted. all that automation gone to the bin, as I would have to kill the CPU hogging process on every authentik version update, on every container restart/recreate, and on every reboot of my host system.
Yes, of course, this is no solution. I didn't want to do this, but there were 3 workers consuming 100% CPU. The alternative would have been shutting down the service. So I tried and it worked out. Even the "change password" Mails, which I tried to send before, came trough on that moment. So something hung, but I wasn't able to figure out, what.
Same here
I am also experiencing this issue and even when stopping this process Authentik keeps working
I'm reasonably sure there's some logic error somewhere that causes the blueprint tasks to recursively re-trigger which causes the high CPU usage, however the high CPU usage is to be expected on the first startup, but only for a couple of minutes while all the initial setup is done
I am experiencing it too, on 2023.6
Same behavior here, on upgraded 2023.6.1 instance and on locally tested fresh beta version. It doesn't go away after startup, trace logging sadly doesn't show anything extraordinary.
same here. Not constantly at 100% anymore since upgrade 2023.6.0 > 2023.6.1. One process: "/lifecycle/ak worker", hovering around 85% CPU. The rest of the host is at least workable again, but seems very high for a single user test setup...
Notifications are being flooded with:
System task exception
Task notification_transport encountered an error: Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/celery/app/trace.py", line 451, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/celery/app/trace.py", line 734, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/celery/app/autoretry.py", line 54, in run
ret = task.retry(exc=exc, **retry_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/celery/app/task.py", line 717, in retry
raise_with_context(exc)
File "/usr/local/lib/python3.11/site-packages/celery/app/autoretry.py", line 34, in run
return task._orig_run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/authentik/events/tasks.py", line 129, in notification_transport
raise exc
File "/authentik/events/tasks.py", line 125, in notification_transport
transport.send(notification)
File "/authentik/events/models.py", line 331, in send
return self.send_email(notification)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/authentik/events/models.py", line 472, in send_email
raise NotificationTransportError(exc) from exc
authentik.events.models.NotificationTransportError: [Errno 99] Cannot assign requested address
recreating the redis container made a difference. Now have two processes constantly between 60% - 85% both: /usr/local/bin/python /usr/local/bin/gunicorn -c ./lifecycle/gunicorn.conf.py authentik.root.asgi:application
The system is still being built and is never under heavy load. At peak times it processes maybe 5 logins per minute.
The log of the worker container shows an endless stream of:
INF event=Task finished logger=authentik.root.celery pid=171830 state=SUCCESS task_id=cd3af25a-cae2-4caf-9080-56602967ac04 task_name=scim_signal_direct timestamp=2023-07-18T21:26:17.970770
INF event=Task started logger=authentik.root.celery pid=171831 task_id=2b35a65d-1c90-463e-84ff-6b2fa10c5b64 task_name=scim_signal_direct timestamp=2023-07-18T21:26:34.099977
INF event=Task finished logger=authentik.root.celery pid=171831 state=SUCCESS task_id=2b35a65d-1c90-463e-84ff-6b2fa10c5b64 task_name=scim_signal_direct timestamp=2023-07-18T21:26:34.119047
INF event=Task started logger=authentik.root.celery pid=171832 task_id=7a633ade-1496-44f6-964f-309dceb04577 task_name=blueprints_discovery timestamp=2023-07-18T21:26:35.894243
INF event=Task started logger=authentik.root.celery pid=171833 task_id=4daa90db-1caf-453e-b315-9fd071b43c00 task_name=clear_failed_blueprints timestamp=2023-07-18T21:26:35.977922
INF event=Task finished logger=authentik.root.celery pid=171833 state=SUCCESS task_id=4daa90db-1caf-453e-b315-9fd071b43c00 task_name=clear_failed_blueprints timestamp=2023-07-18T21:26:36.017502
INF event=Task finished logger=authentik.root.celery pid=171832 state=SUCCESS task_id=7a633ade-1496-44f6-964f-309dceb04577 task_name=blueprints_discovery timestamp=2023-07-18T21:26:36.163187
There are also 140+k notifications pending.
😒
No new ones are being added in the mean time. It looks like it are all the same as mentioned in the previous comment. I can remove them one by one, but the 'Clear All' button kicks of some other processes (DB SELECT) that seem to linger endlessly eating the other two cores assigned to the host. Have had it running for over a day but never known it to finish and the number of notifications remains the same. Could I possibly scrap those in the database or something?
For me, 'python -m manage worker' is pinned to 99 cpu. I just installed authentik yesterday and am the only user thus far. No other application is using nearly as much cpu as this.
Turning on debug logging prints nothing after the initial startup, when no activity should be occurring
Couple of days later, those processes seem to have cooled down. Notifications stopped adding up around 280k. Haven't been able to clear them, the host has about 7GB of available RAM, but runs out before the select statement on the database finishes... Will retry with other containers shut down later. Will keep an eye on performance, but at the moment, clear sailing,
Same, mine also cooled off after a few hours
FWIW this just happened to me and all of the alerts appear to be stored in the authentik_events_notification
table.
I ended up issuing a TRUNCATE TABLE authentik_events_notification;
just to nuke every notification and start fresh, but you could also issue an update and flip the seen
flag from f
to t
.
(FYI I went to truncate authentik_events_notification but it's empty so that can't be the problem for everyone).
Did anyone ever get to the bottom of this?
I've tried deploying Authentik several times but keep getting blocked because of how poorly it performs due to the worker container pegging whatever CPU we throw at it at 100%.
This problem seems to be wide spread with both open and closed tickets.
I've also started a discussion on Discord as not getting much traction on Github: https://discord.com/channels/809154715984199690/1166120303341084693/1166120303341084693
I also experienced this. I turned of the worker container until devs fix this issue.
I have a pretty basic setup for using on my homelab. I already had set up my apps and users and my workflows seems to working fine without the worker is running.
As per https://github.com/goauthentik/authentik/issues/7025#issuecomment-1828894097 this is still a major issue.
any tips/tricks how to deal with this on Unraid running authentik via templates?
Same CPU issue... New deployment 12/18/2023, :latest of all images. It was suggested by @BeryJu that the worker will settle down after a few min. Not in my case. I also see a huge increase in disk activity. This makes SSH sessions as well as ALL other running services respond so slowly that it effectively takes them all down. I will try deploying without the worker as using with the documented and suggested deployment, to include the worker, is untenable. Where the CPU and Disk I/O drop off on the graph below is where I shut the stack down.
Switched off worker here for a couple of weeks now. waiting for a workable update.
Update: updated to Release 2023.10.5. Same issue.
DKing from the Discord discussion from https://github.com/goauthentik/authentik/issues/5746#issuecomment-1775995668 posted a link to a PR that fixed my high CPU usage: https://github.com/goauthentik/authentik/pull/7762
Switched off worker here for a couple of weeks now. waiting for a workable update.
Update: updated to Release 2023.10.5. Same issue.
Reporting back (unraid-solved): In hind side I did 3 things, not sure what solved it. 1) in the Unraid template I added "-ulimit nofile=10240:10240" in Extra Parameters field as flag (advanced view) 2) redeployed (removing containers and images) both worker and authentik. 3) added AUTHENTIK_REDIS__DB:1 as variable to the unraid template for both Worker and authentik. Now everything seems normal.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I recon this should not be closed, as it is not solved...
For anyone that still has this issue, please check Jens' latest comment on #7762. We have made some improvements in 2024.2.2 that should prevent this from happening.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I too have this issue. The configuration is exactly the one from the docker compose documentation. 100% celeryd cpu usage. Vanilla and up-to-date Archlinux install using docker-compose. AUTHENTIK_TAG:-2024.4.2
I too have this issue. The configuration is exactly the one from the docker compose documentation. 100% celeryd cpu usage. Vanilla and up-to-date Archlinux install using docker-compose. AUTHENTIK_TAG:-2024.4.2
same issue for me
I have the same issue with the docker image 2024.6.0
same issue here.
ATHENTIK_TAG=2024.6.1
I have the same issue with the docker image 2024.6.0
Seems like there is some weird unexpected behaviour. I was able to make the worker start and to reach the initial setup endpoint after a couple of restarts and waiting about 5-10min. The log messages did not change but I was eventually start the setup process. I did not change any configs.
Describe the bug Right after starting up my docker-compose setup based on the given docker-compose.yml file, the worker-container causes high CPU load.
To Reproduce Steps to reproduce the behavior:
docker-compose up
docker-compose top
/usr/local/bin/python /usr/local/bin/celery -A authentik.root.celery worker -Ofair --max-tasks-per-child=1 --autoscale 3,1 -E -B -s /tmp/celerybeat-schedule -Q authentik,authentik_scheduled,authentik_events
docker-compose stop worker
Expected behavior I would expect the system tasks would not be fired every second or continuously and not consuming so much cpu.
Screenshots![image](https://github.com/goauthentik/authentik/assets/5604131/ba0ecbeb-a465-4e89-a568-f57c5c9e503b)
Logs
docker compose top
docker compose logs worker
Version and Deployment (please complete the following information):
Additional context I tested some other “fixes” in other issues (closed already), like down to
2023.2.1
, applying a patch of a dev version, etc.. Nothing works.