jitsi / jitsi-autoscaler

Jitsi Autoscaler microservice
Apache License 2.0
30 stars 18 forks source link

CPU and memory keep growing #131

Open Arzar opened 1 year ago

Arzar commented 1 year ago

I'm using Jitsi autoscaler latest commit (acf86ac 2023/01/13) on ubuntu 22.04 arm64 Oracle cloud: Ampere A1 flex, 2CPU/4Gb mem

The CPU and memory used by the node process keep climbing slowly. Starting one month ago from almost 0% CPU and 0% memory, it is now about 105% CU and 20% memory $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 455849 ubuntu 20 0 1664176 792724 34056 R 102.7 19.8 23329:20 node

In our test environment there is just one JVB running and reporting statistics to the test autoscaler without any scale-up or down. In our production environment we do scale up and down but we restart the autoscaler every night because of this CPU/memory issue.

I profiled our test autoscaler with chrome://inspect and got this:

46492.8 ms43.13 % | 97726.0 ms90.67 % | (anonymous) status.js:82 |   46492.8 ms43.13 % | 97726.0 ms90.67 % | ........listOnTimeout internal/timers.js:502 |   46492.8 ms43.13 % | 97726.0 ms90.67 % | ...............processTimers internal/timers.js:482 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | (anonymous) status.js:96 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | ........(anonymous) status.js:94 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | ...............get stats status.js:93 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | ......................(anonymous) status.js:82 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | .............................listOnTimeout internal/timers.js:502 |   43874.9 ms40.71 % | 43874.9 ms40.71 % | ..................................processTimers internal/timers.js:482 |   4877.3 ms4.53 % | 5978.8 ms5.55 % | (anonymous) status.js:124 |   4877.3 ms4.53 % | 5978.8 ms5.55 % | .......get stats status.js:93 |   4877.3 ms4.53 % | 5978.8 ms5.55 % | ...............(anonymous) status.js:82 |   4877.3 ms4.53 % | 5978.8 ms5.55 % | ......................listOnTimeout internal/timers.js:502 |   4877.3 ms4.53 % | 5978.8 ms5.55 % | .............................processTimers internal/timers.js:502 |  

This suggest that the node process is drowning in timer management, but I'm not sure how to debug more. Does the Jitsi team encounter this issue? Do you have some suggestion how track the root cause?

aaronkvanmeerten commented 1 year ago

Hi thanks for raising this issue. We have not been running the latest codebase in our production systems until recently, and now have experienced the same issue. We will be working to track it down and I'll try to report back here when we do figure it out/fix it. In the meantime if you have any more details about what you saw please let me know!

aaronkvanmeerten commented 1 year ago

We have merged one commit which had updated the underlying oci sdk, which seems to have been the culprit. Once that was reverted, the current autoscaler docker image jitsi/autoscaler:0.0.19 is using this change, and seems to resolve the behavior.

aaronkvanmeerten commented 1 year ago

In addition, I have a PR open that was used to build the docker image tagged jitsi/autoscaler:0.0.20 which includes library updates and the required code updates to match. https://github.com/jitsi/jitsi-autoscaler/pull/145

this one we haven't run except in a dev environment, and cannot speak to how much long-running CPU/memory it consumes but I'll report back here if it looks OK and gets merged.

aaronkvanmeerten commented 1 year ago

I have merged the updated package dependencies for latest, and would suggest you try testing either latest master (0.0.20 in dockerhub) or the commit prior (0.0.19 in dockerhub) depending on your taste for the novel. I am promoting 0.0.19 to production in our systems now, and hope that you can weigh in eventually to let us know if one of these candidates solve your issue. Thanks again for your report and sorry it took so long to address it!

aaronkvanmeerten commented 11 months ago

It seems that 0.0.20 continues to leak. The latest candidate in master is also in dockerhub as 0.0.22, with the older oci-sdk reverted but otherwise all other dependencies updated. Please note that if you were scraping prometheus metrics from the autoscaler, this has moved to a new port as of 0.0.21

Arzar commented 11 months ago

Thanks for the follow-up! Our system is now in production so it's difficult to do any changes, but next time we do a major update I will try to increase the version of the jitsi-autoscaler.

aaronkvanmeerten commented 11 months ago

I've confirmed that 0.0.22 does not show the leak, and 0.0.21 does. I have opened an issue with the oci typescript sdk project: https://github.com/oracle/oci-typescript-sdk/issues/247 in case this is of interest to you.