redhat-developer / rh-che

Eclipse Che hosted by Red Hat
https://che.openshift.io/
Eclipse Public License 2.0
40 stars 57 forks source link

Tool for monitoring untracked workspace deployments #1690

Closed skabashnyuk closed 4 years ago

skabashnyuk commented 4 years ago

Issue problem: During testing https://github.com/eclipse/che/issues/15006 on che.openshift.io I've got into the situation when some exception happens during workspace stop/delete. That exception caused that

Red Hat Che version:

version: (help/about menu)

Reproduction Steps:

Describe how to reproduce the problem

Runtime:

runtime used:

skabashnyuk commented 4 years ago

I think https://github.com/redhat-developer/rh-che/issues/1691 this can cause Untracked workspace deployments too.

ibuziuk commented 4 years ago

untracked deployments is the side-effect of the https://github.com/eclipse/che/issues/15006 I will proceed with cleanup once we have a fix on production

skabashnyuk commented 4 years ago

this is not the first time when we have such a thing and not the last time, is that possible to have some general solution to report and clean up such things?

amisevsk commented 4 years ago

is that possible to have some general solution to report and clean up such things?

It may be possible but this would be hard to envision as a Che feature; how we check users and their (unique) namespaces in a way that scales to thousands of users?

amisevsk commented 4 years ago

Before sinking significant time into implementing something along these lines, I think some more design discussion is required.

As I see it, there are three options for implementing this functionality, in terms of where and how this service would run.

  1. The workspace tracker runs separately from all clusters, potentially as a cron job or similar.

    Pros:

    • Fairly easy to implement -- could be as simple as doing effectively
      pods = oc get po -l che.workspace.id
      for each pod in pods:
      check if workspace is running in DB 
      remove resources if it isn't

    Cons:

    • Not clear on how permissions would be managed (need cluster-write access to tenant clusters, database secret and cluster access for dsaas cluster).
    • Would have to manage multiple clusters and connections (get pods from tenant clusters, get running workspaces from dsaas database).
    • Managing deployment/running would require SD input
  2. Runs as a separate service in the dsaas/preview cluster

    Pros:

    • Same deployment strategy as other services (e.g. k8s-image-puller)
    • Could reuse existing functionality (DB connection, rhche secret) to automatically look into tenant clusters and get configuration that Che uses.

    Cons:

    • Another service, image, and CI to maintain.
    • Would have to share rhche secrets and be another service that has use of the rhche SA token
  3. Service is a scheduled job of Che server.

    Pros:

    • Could be implemented fairly easily since e.g. database communication logic is available
    • Would be easier to upstream, and simplify deployments if we chose to upstream it

    Cons:

    • Could be a long running job that bogs down normal Che functionality to only check for a few stray workspaces out of thousands
    • Upstream utility is not clear
    • If downstream, would need to be potentially maintained for each Che version.
    • Upstreaming might not be meaningful as namespaces are handled differently.

Personally, I'm leaning towards option 2, but that's because we know how to deploy and update such a service. I don't think it's suitable to plan this sort of thing for upstream, since the assumptions are very different.

ibuziuk commented 4 years ago

@amisevsk IMO we should stick to a solution that could be re-used in the upstream (not necessarily embedded in che-server, it could be auxiliary deployment like k8s-image-puller). Speaking about 3. Service is a scheduled job of Che server. - isn't it smth that @sleshchenko already implemented upstream and we just need to make sure it works properly on Hosted Che side ?

ibuziuk commented 4 years ago

Talked with @sleshchenko and what we currently have in the upstream is - RuntimeHangingDetector - https://github.com/eclipse/che/blob/master/infrastructures/kubernetes/src/main/java/org/eclipse/che/workspace/infrastructure/kubernetes/RuntimeHangingDetector.java

Tracks STARTING/STOPPING runtimes and forcibly stop them if they did not change status before a timeout is reached.

amisevsk commented 4 years ago

Yeah, the RuntimeHangingDetector is a different case, since we can look for STARTING and STOPPING workspaces in the Che db and track those; for untracked deployments, the workspace may not even be in the database as STOPPED, if e.g. the user has deleted it.

The flow I would follow for this would be:

  1. For each user, get everything with label che.workspace_id in their <username>-che namespace
  2. For each workspace id there, check if workspace has RUNNING entry in database
  3. If it doesn't remove all resources labelled che.workspace_id=<workspaceId> in that user's namespace.

The worry comes in where we have to scale to thousands of users.

ibuziuk commented 4 years ago

The worry comes in where we have to scale to thousands of users.

Could not we get the other way round:

  1. get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id
  2. (optional) identify those which are running for more than n (e.g. 24h) hours.
  3. check if the workspace name (based on pod name) dedicated to the pod exists and check it's state (we might need to use admin account for that)
  4. If workspace does not exist or status is stopped - remove all resources labelled che.workspace_id= in that user's namespace.
amisevsk commented 4 years ago

@ibuziuk The issue is an access problem:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id

This requires admin access to the tenant clusters, so we're no longer talking about something that runs in dsaas without a new config (I don't know if SD supports this flow).

ibuziuk commented 4 years ago

Closing, untracked deployments are currently expected to be tracked manually