openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

Web console creates high CPU load after several hours later without any manipulation #1921

Closed nak3 closed 9 years ago

nak3 commented 9 years ago

Environment

Chorome

$ google-chrome --version
Google Chrome 39.0.2171.95 

OpenShift

# openshift version
openshift v0.4.3.2-125-g1104fae
kubernetes v0.14.1-582-gb12d75d

Reproduce steps

  1. Create project and pod (hello-openshift) followed by this material.
  2. Access to the web console.

    eg. https://ose3-master.example.com:8443/console/project/demo/browse/pods

  3. After 2 hours later without any manipulation of webconsole, process of the Chrome becomes high CPU load.

    Additional information

    • When I checked Task Manager of the Chrome, Javascript memory of OpenShift Webconsole always keep rising.
    • After CPU high load, it will be fixed soon by moving to the other page or the page reload
brenton commented 9 years ago

I've started noticing this in my environments too with Firefox.

liggitt commented 9 years ago

If you enable INFO logging (https://github.com/openshift/origin/tree/master/assets#enable--disable-console-log-output), does anything jump out at you when the browser should be in a steady state after the initial page load?

nak3 commented 9 years ago

@liggitt Sorry my late reply. I changed loglevel and watched the logs. The output was here [1].

This logs are ouptut when the browser process is not highload, but it is output higher cycle and continuously when it is highload.

[1] https://kenjiro.fedorapeople.org/tmp/ose3-highload-console.log

liggitt commented 9 years ago

Wondering if this is related to token expiration with websockets. @nak3 / @brenton, do you know what your access token timeout is set to? If you hit $server/oauth/token/request, it should give you a token and tell you the expiration time.

@spadgett, if you want to recreate sitting on a project page with websockets reopening, with an expired token, you can do the following:

  1. Modify the ReadTimeout in pkg/cmd/server/origin/master.go to something short like 30 * time.Second. rebuild and rerun openshift start master. This will make websocket connections timeout in 30 seconds instead of 5 minutes.
  2. Log into the web console and view a project overview page.
  3. Find the access token (localStorage["LocalStorageUserStore.token"]) and delete it using osc:

    $ osc delete oauthaccesstoken [token]
  4. In 30 seconds, the websocket should timeout and attempt to re-establish and you should see error messages appear in the console. Ideally, those 401 errors would trigger a login (#1773), but they should at least not prompt repeated attempts to establish the websockets and run up CPU
nak3 commented 9 years ago

@liggitt token time out is 86400. I think it is default value. https://github.com/openshift/origin/blob/6773f17b61257b2ad9176b0fd3b08a4521793921/pkg/cmd/server/start/master_args.go#L270

{
   "access_token": "YTg0NzgwOGYtYjJmMy00OWY5LWI5NzktYTg5ZjU3NTgyYzlh",
   "expires_in": 86400,
   "token_type": "bearer"
}
sdodson commented 9 years ago

After 12 hours idling Firefox was oomkilled after chewing through 8GB so I think we're just leaking something. Leave chrome sitting for five minutes on a project overview and the heap utilization goes from 40mb to 70mb after collections.

I tried to follow https://developer.chrome.com/devtools/docs/javascript-memory-profiling#object-allocation-tracker but I got a bit lost trying to figure out what's creating them but I think we're leaking HTMLAnchorElement objects?

spadgett commented 9 years ago

I've identified at least one significant memory leak. The hawtio-core-navigation component we use for the left sidebar leaves JavaScript references to detached DOM nodes (as @sdodson notes, HTML anchor elements). These never get garbage collected and account for a significant amount of the memory usage of the web console. Thousands of these are created.

new_tab

Here is the CPU profile after leaving the console idle for an hour.

openshift_web_console

This function in hawtio-core-navigation appears to be the source of the leaked DOM nodes:

https://github.com/hawtio/hawtio-core-navigation/blob/master/dist/hawtio-core-navigation.js#L644

The temporary anchor element used to resolve relative URIs is never garbage collected.

I've modified hawio-core-navigation locally. Reusing one shared anchor element instead of creating a new one at each function invocation causes the leak to go away. I'm still concerned about how often the isSelected function is called, however.

We might have other memory leaks, but this looks like the most significant.

liggitt commented 9 years ago

@spadgett can you pull in https://github.com/hawtio/hawtio-core-navigation/releases/tag/v2.0.48

spadgett commented 9 years ago

I've discovered another issue where we're not properly closing web sockets when navigating between pages. This results in many web sockets being open for watches on each resource type and might explain some of the CPU load.

spadgett commented 9 years ago

The latest hawtio-core-navigation fixes the memory leak, but causes item.isSelected() to be called constantly. This causes high CPU load anytime the Web Console is open.

liggitt commented 9 years ago

The following three issues have been addressed:

We think these address the main causes.

liggitt commented 9 years ago

Reopening... @spadgett found one more issue related to websockets getting reopening on failure

spadgett commented 9 years ago

@liggitt Here is the error from the server that causes the client to reopen web sockets again and again.

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1beta3","metadata":{},"status":"Failure","message":"401: The event in requested index is outdated and cleared (the requested history has been cleared [1622/1013]) [2621]"}}

I saw this when I was changing the number of desired replicas.

Interestingly, I noticed the Cannot read property 'triggers' of undefined error from #2725 in the console as well:

TypeError: Cannot read property 'triggers' of undefined
    at associateDeploymentConfigTriggersToBuild (overview.js:286)
    at overview.js:320
    at Object.forEach (angular.js:331)
    at Object.<anonymous> (overview.js:319)
    at fire (jquery.js:3094)
    at Object.self.fireWith (jquery.js:3206)
    at Object.self.fire (jquery.js:3213)
    at data.js:630
    at Scope.$eval (angular.js:14383)
    at Scope.$apply (angular.js:14482)

I wonder if it's related.

liggitt commented 9 years ago

Ah yes. Our watch handler isn't dealing with ERROR responses. In that case we have to relist.

On Jun 6, 2015, at 10:16 AM, Sam Padgett notifications@github.com wrote:

@liggitt Here is the error from the server that causes the client to reopen web sockets again and again.

{"type":"ERROR","object":{"kind":"Status","apiVersion":"v1beta3","metadata":{},"status":"Failure","message":"401: The event in requested index is outdated and cleared (the requested history has been cleared [1622/1013]) [2621]"}}

I saw this when I was changing the number of desired replicas.

Interestingly, I noticed the Cannot read property 'triggers' of undefined error from #2725 in the console as well:

TypeError: Cannot read property 'triggers' of undefined at associateDeploymentConfigTriggersToBuild (overview.js:286) at overview.js:320 at Object.forEach (angular.js:331) at Object. (overview.js:319) at fire (jquery.js:3094) at Object.self.fireWith (jquery.js:3206) at Object.self.fire (jquery.js:3213) at data.js:630 at Scope.$eval (angular.js:14383) at Scope.$apply (angular.js:14482) I wonder if it's related.

— Reply to this email directly or view it on GitHub.

liggitt commented 9 years ago

I'm working on a fix for the relist

spadgett commented 9 years ago

Web Console memory and CPU usage are now dramatically lower.

screen shot 2015-06-09 at 4 57 41 pm

liggitt commented 9 years ago

this makes me so happy