redhat-developer / rh-che

Eclipse Che hosted by Red Hat
https://che.openshift.io/
Eclipse Public License 2.0
40 stars 57 forks source link

Che 6: "Workspace is not running" popup creating a workspace from codebase UI #629

Closed l0rd closed 6 years ago

l0rd commented 6 years ago

Steps to reproduce:

1- From "Codebase", click on "Create Workspace" 2- IDE is loading and once loaded displayed to the user

Current output:

Expected output:

Workaround

See screenshot: screen shot 2018-04-11 at 14 30 45

The issue is getting problematic when the user start a second workspace: screen shot 2018-04-11 at 14 33 54

ibuziuk commented 6 years ago

The issue is getting problematic when the user start a second workspace

This part is super weird ^ - Che treats first workspace as "Started / Starting" even thought it is stopped....

ibuziuk commented 6 years ago

Seems to be some race condition issue, on the second attempt I was able to start new workspace without pop-up being shown

ibuziuk commented 6 years ago

@rhopp @ScrewTSW could you please confirm that the issue is intermittent ? this would explain why it was not caught by functional / che-starter integration tests

ibuziuk commented 6 years ago

@rhopp on che-starter side integration tests with starting workspaces were disabled [1] some time ago due to https://github.com/openshiftio/openshift.io/issues/2273 Against which oso cluster ci test account is configured now ?

[1] https://github.com/redhat-developer/che-starter/blob/master/src/test/java/io/fabric8/che/starter/client/WorkspaceClientTest.java#L106

l0rd commented 6 years ago

We need to consider that this issue was reproducible right after Che 6 announcements (>500 requests/hour to che-starter according to unleash).

I have spotted one error and a couple of warnings in che-starter logs:

Err

Unexpected error occurred invoking async method 'public io.fabric8.che.starter.model.workspace.Workspace io.fabric8.che.starter.client.WorkspaceClient.startWorkspaceAsync(java.lang.String,java.lang.String) throws io.fabric8.che.starter.exception.WorkspaceNotFound'

Warnings

'name' field is blank, using 'login' osioide-bot as committer name in 'Git' preferences
Could not deserialize che server response, possibly v6
ibuziuk commented 6 years ago

Some interesting test results with workspace status API. There were integration tests in che-starter for starting workspace / starting workspace when there is already one running that were disabled due to https://github.com/openshiftio/openshift.io/issues/2273 Now our ci test account is provisioned not against free-stg, so I decided to enable this tests again [1]. Basically, there are 2 commits in PR:

  1. Just enabling tests. It fails with the following error:
Failed tests: 
  WorkspaceClientTest.createAndStartWorkspace:95 expected:<[START]ING> but was:<[RUNN]ING>
  WorkspaceClientTest.createAndStartWorkspaceWhenThereIsAlreadyOneRunning:125 expected:<[START]ING> but was:<[RUNN]ING>

This is very strange since workspace status check happens after waitUntilWorkspaceIsRunning and there is a message in the logs that workspace is actually RUNNING :

{"time":"2018-04-11T15:33:24.049+01:00","msg":"Workspace 'ifpgm' is running","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"main","level":"INFO","HOSTNAME":"n18.dusty.ci.centos.org"}

  1. After removing status check in the second commit tests are started to pass against Che 6.

Basically, in Che 6 endpoint for obtaining status was removed [2], and status is fetched directly from workspace definition, which for some reason is flapping in the test ("RUNNING / STARTING")

[1] https://github.com/redhat-developer/che-starter/pull/289 [2] https://github.com/redhat-developer/che-starter/commit/86a16eab987e07c1597d5ef94caf60ca9b37e871

ibuziuk commented 6 years ago

We need to consider that this issue was reproducible right after Che 6 announcements (>500 requests/hour to che-starter according to unleash).

Does not look like a high load ^

@l0rd warning are ok:

Could not deserialize che server response, possibly v6

Just stays that response does not have che 5 format and is che 6

'name' field is blank, using 'login' osioide-bot as committer name in 'Git' preferences

User's GitHub name is not publicly available, so we use GitHub login in Git preferences

Error looks strange:

Unexpected error occurred invoking async method 'public io.fabric8.che.starter.model.workspace.Workspace io.fabric8.che.starter.client.WorkspaceClient.startWorkspaceAsync(java.lang.String,java.lang.String) throws io.fabric8.che.starter.exception.WorkspaceNotFound'

It is not expected that workspace we are trying to start does not exist. In any case those warning / errors does not seem to be related to this issue.

ibuziuk commented 6 years ago

I suspect there is some problem / race condition with starting workspace upstream - was able to reproduce the problem against https://rhche.openshift.io/dashboard/#/ dashboard:

  1. Create workspace against vert.x stack
  2. Press "Open in IDE" button: image
  3. ERROR: Workspace is not starting image
  4. Press "Start" -> workspace is successfully started
ibuziuk commented 6 years ago

Created an issue upstream for clarifications https://github.com/eclipse/che/issues/9381

slemeur commented 6 years ago

Just an heads up that I just hit the error - with the flow in the description. (not the flow from rhche.openshift.io)

ibuziuk commented 6 years ago

Some related openshift.io issues:

ibuziuk commented 6 years ago

@slemeur @gorkem @l0rd I started to doubt if che.workspace.auto_start=false is required for Che 6. Basically, I can not come up with any meaningful use-case for it. Auto-start property was introduced in Che 5 because there were no way to limit # of running workspace and once user navigated to workspace by url (e.g. bookmark) we had to prevent startup since no proper explanation of why startup failed (already one running workspace / quota limit) would not be provided from che-server side. However, the situation changed with CHE_LIMITS_USER_WORKSPACES_RUN_COUNT [1] and now user will get meaningful message from che server (You are not allowed to start more workspaces). IMO by removing che.workspace.auto_start we would have only pros and no cons:

WDYT ? maybe I'm missing smth

[1] https://github.com/redhat-developer/rh-che/issues/226

l0rd commented 6 years ago

@ibuziuk I agree with you: we should set che.workspace.auto_start to true now. But I would be cautious about doing it right away. Let me try to explain why:

I have created another issue (an enhancement rather than a bug) to address the auto_start property #644

ibuziuk commented 6 years ago

ok, seems that @rhopp was right about the race condition thingy. So, what actually happens:

out

What I would propose:

@l0rd @rhopp wdyt ?

rhopp commented 6 years ago

What I would propose:

  • set che.workspace.auto_start to true for prod-preview
  • update che-starter and remove async startup (che-starter will only stop all running workspaces which would be syncronous)
  • get QA approval that all works fine before updating prod

I think this could work. One question though: Codebases-ui will open the workspace after the running workspaces are successfully stopped, or independently to that?

garagatyi commented 6 years ago

Looks like what is happening in IDE:

  1. Send request to start WS
  2. Open IDE
  3. IDE is OPened and loaded
  4. IDE checks WS status which is STOPPED for the time being
  5. IDE shows that WS is stopped

Because of the current code in upstream Che and RH-Che customization request that starts WS may take some time, whereas usually it is expected that it is fully async and instantaneous. There is an issue to rework that, but it is complicated - not sure when it will be fixed.

As far as I understand setting che.workspace.auto_start to true would show an error when a user clicks start workspace button inside of the IDE in the OSIO flow (because WS was started by che-starter in the meantime). If there is no error like this we probably have a some another bug. @ibuziuk can you check whether we get an error in IDE in this case?

To fix this situation properly we should change the behavior of the IDE. Now:

  1. IDE checks whether WS is STOPPED
  2. If it is IDE shows a pop-up.

Would be:

  1. IDE checks whether WS is STOPPED
  2. IDE subscribes to WS events
  3. IDE checks whether WS is STOPPED again (kinda double-checking locking)
ibuziuk commented 6 years ago

I think this could work. One question though: Codebases-ui will open the workspace after the running workspaces are successfully stopped, or independently to that?

@rhopp only when all workspaces are stopped codebase UI will get the url for opening a new workspace. Assuming that grace termination period for pods is '0' which allows to terminate workspace almost instantly, and max # of running workspaces on osio = 1 this should not take long

ibuziuk commented 6 years ago

@garagatyi if workspace is started by che-starter and user press "start workspace" button the following error will be shown:

image

ibuziuk commented 6 years ago

Created an issue upstream for IDE update on workspace events https://github.com/eclipse/che/issues/9496

ibuziuk commented 6 years ago

It seems that the only thing we are missing is "why/when sometimes stopping workspaces & starting new workspace takes longer than expected". Continue investigating......

slemeur commented 6 years ago

Thanks @ibuziuk, @rhopp and @garagatyi for chasing the issue. Indeed it seems the error is in the IDE which is not updated on workspace events.

why/when sometimes stopping workspaces & starting new workspace takes longer than expected

Agree, we need to get an understanding of this.

ibuziuk commented 6 years ago

why/when sometimes stopping workspaces & starting new workspace takes longer than expected

The were some assumptions of why sometimes stopping workspaces & starting new workspace could take longer than expected ( > 5 seconds):

  1. number of user workspaces:

This is not the case since startup time of the workspace does not depend on #. Basically, it takes ~1 second to create & start workspace when there are 30 workspaces already created (the same figure is when user have no workspaces at all):

{"time":"2018-04-23T11:33:46.566+00:00","msg":"Async startup took (nano seconds): 936720630","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"ThreadPoolTaskExecutor-1","level":"INFO","HOSTNAME":"che-starter-79-7jb24","req_id":"6a5a5c7d-cd33-49d8-a590-3a24465afb26","identity_id":"266debf2-570c-467e-98c9-f53b90172cbd"}
--
  | {"time":"2018-04-23T11:33:46.566+00:00","msg":"Async startup took (seconds): 0","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"ThreadPoolTaskExecutor-1","level":"INFO","HOSTNAME":"che-starter-79-7jb24","req_id":"6a5a5c7d-cd33-49d8-a590-3a24465afb26","identity_id":"266debf2-570c-467e-98c9-f53b90172cbd"}
  1. Stopping of previously running workspace takes longer than expected:

Basically, when there is already one workspace running and user tries to create another one, che-starter stops the first one before starting new. This takes longer since there is a polling of workspace status in place but it also happens quite fast (1-3 seconds) and can not explain why we were able to constantly reproduce the issue:

{"time":"2018-04-23T11:41:28.465+00:00","msg":"Async startup took (nano seconds): 1419382584","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"ThreadPoolTaskExecutor-1","level":"INFO","HOSTNAME":"che-starter-79-7jb24","req_id":"e7995c32-8297-49c5-bbbd-6109a1bb7111","identity_id":"266debf2-570c-467e-98c9-f53b90172cbd"}

  1. delay of async execution due to ThreadPoolTaskExecutor configuration I have created a project with synthetic test where async service is called n times during processing (the same config of async executor is used in che-starter). There were no delays spotted for 1000 of simultaneous processing of async requests. According to zabbix max # of concurrent workspaces for the last month is 15 and even if all those 15 where started at the same time would not be any issues with async processing on che-starter side.

Basically, none of those 3 can explain the reason of failures. Continue investigating...

[1] https://github.com/ibuziuk/spring-boot-async-demo

ibuziuk commented 6 years ago

Have done more integration tests based on @rhopp scripts [1]. One issue that I have spotted so far is that when multiple requests for workspace creation are coming from a single user simultaneously (create and start 10 workspaces) this results the problem when workspaces are starting and stopping each other unpredictably e.g. 1 workspace is staring / 2 is created & started waits till 1 is stopped / 3 is starting waits till 2 is stopped etc. since this is asyn process it may result tolong-running async request processing:

{"time":"2018-04-23T16:09:26.959+00:00","msg":"Async startup took (seconds): 23","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"ThreadPoolTaskExecutor-1","level":"INFO","HOSTNAME":"che-starter-79-7jb24","req_id":"wjxtgsczfbt1r7wh","identity_id":"8b469293-4f57-4815-8d7d-dd5bbad27a50"}

I was not able to reproduce any issues when 1 workspace is created & started by 10 users simultaneously - startup still takes ~1 second

April 23rd 2018, 18:20:01.203 | che-starter | {"time":"2018-04-23T16:20:01.203+00:00","msg":"Async startup took (seconds): 1","logger":"io.fabric8.che.starter.client.WorkspaceClient","thread":"ThreadPoolTaskExecutor-1","level":"INFO","HOSTNAME":"che-starter-79-7jb24","req_id":"jt8smtn0anpupndr","identity_id":"8b469293-4f57-4815-8d7d-dd5bbad27a50"}

One more observation / speculation. Basically, when there is a volume mount issue it also results in "Workspace is not running" pop-up which I was able to reproduce:

image

Wondering could potentially volume mount issue being aggravated under the load - when multiple users tries to mount multiple volumes ? This could technically lead to multiple Unable to mount volumes for pod issues and workspaces would fail to start - the only thing that would be shown on UI is workspace is not running. If volume was mounted before user hits refresh than workspace would be started successfully. Similar problem was reported by @ldimaggi [1] but no openshift events were provided and it does not seem to be reproducible anymore.

[1] https://gist.github.com/rhopp/e4fd70a0494fd740e33c4c243b302ef5 [2] https://github.com/openshiftio/openshift.io/issues/3135#issuecomment-381781221

ibuziuk commented 6 years ago

Today, we have emulated click party during the standup and @rhopp was able to reproduce the problem. some insights:

Some info from logs:

April 24th 2018, 15:37:43.444 Async startup took (seconds): 1 fluent-plugin-systemd April 24th 2018, 15:37:03.637 Async startup took (seconds): 1 fluent-plugin-systemd April 24th 2018, 15:36:46.612 Async startup took (seconds): 1 fluent-plugin-systemd April 24th 2018, 15:36:22.278 Async startup took (seconds): 1 fluent-plugin-systemd April 24th 2018, 15:34:32.304 Async startup took (seconds): 8 fluent-plugin-systemd April 24th 2018, 15:33:44.956 Async startup took (seconds): 0 fluent-plugin-systemd

the second workspace startup took 8 seconds and this lead to Workspace is not running By req_id I was able to identify that stopping previously running workspace took 8 seconds:

April 24th 2018, 15:34:32.107 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:31.076 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:31.076 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:30.044 Polling Che server for workspace 'hdshj' status... April 24th 2018, 15:34:29.021 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:27.942 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:26.889 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:25.831 Polling Che server for workspace 'hdshj' status...
April 24th 2018, 15:34:24.800 Polling Che server for workspace 'hdshj' status...

I was provided with the openshift events and there were a few fail mount problems like, but those does not seem to be related to the startup of this particular workspace (workspacenmba4t4gkgg27lr1):

Error: cannot find volume \"claim-che-workspace\" to mount into container \"container\""

Unable to mount volumes for pod \"workspaces6c4bt2uivk938nq.dockerimage_rhopp-preview-che(88c12e98-47c4-11e8-91b0-029a5a55534e)\": timeout expired waiting for volumes to attach/mount for pod \"rhopp-preview-che\"/\"workspaces6c4bt2uivk938nq.dockerimage\". list of unattached/unmounted volumes=[claim-che-workspace default-token-2n4wf]
ibuziuk commented 6 years ago

@rhopp PR with removal startup logic is available on prod-preview / workspace auto start is also set to true on prod-preview osd. Could you please verify ?

vparfonov commented 6 years ago

Is still actual for now?

ibuziuk commented 6 years ago

@vparfonov already fixed on prod, closing