Tool for monitoring untracked workspace deployments

skabashnyuk commented 4 years ago

Issue problem: During testing https://github.com/eclipse/che/issues/15006 on che.openshift.io I've got into the situation when some exception happens during workspace stop/delete. That exception caused that

Che doesn't know about deployments or maybe some other resources related to workspace
Resources are still running and there is no code that is tracking its lifetime. I believe that this is a complex issue, that happens before and will happen again. I think right now there are multiple deployments that che users are not aware of.

Red Hat Che version:

version: (help/about menu)

[ ] I can reproduce it on latest official image

Reproduction Steps:

Describe how to reproduce the problem

Runtime:

runtime used:

[ ] minishift (include output of minishift version)
[x] OpenShift.io
[ ] Openshift Container Platform (include output of oc version)

skabashnyuk commented 4 years ago

I think https://github.com/redhat-developer/rh-che/issues/1691 this can cause Untracked workspace deployments too.

ibuziuk commented 4 years ago

untracked deployments is the side-effect of the https://github.com/eclipse/che/issues/15006 I will proceed with cleanup once we have a fix on production

skabashnyuk commented 4 years ago

this is not the first time when we have such a thing and not the last time, is that possible to have some general solution to report and clean up such things?

amisevsk commented 4 years ago

is that possible to have some general solution to report and clean up such things?

It may be possible but this would be hard to envision as a Che feature; how we check users and their (unique) namespaces in a way that scales to thousands of users?

amisevsk commented 4 years ago

Before sinking significant time into implementing something along these lines, I think some more design discussion is required.

As I see it, there are three options for implementing this functionality, in terms of where and how this service would run.

The workspace tracker runs separately from all clusters, potentially as a cron job or similar.

Pros:
- Fairly easy to implement -- could be as simple as doing effectively
```
pods = oc get po -l che.workspace.id
for each pod in pods:
check if workspace is running in DB 
remove resources if it isn't
```
Cons:
- Not clear on how permissions would be managed (need cluster-write access to tenant clusters, database secret and cluster access for dsaas cluster).
- Would have to manage multiple clusters and connections (get pods from tenant clusters, get running workspaces from dsaas database).
- Managing deployment/running would require SD input
Runs as a separate service in the dsaas/preview cluster

Pros:
- Same deployment strategy as other services (e.g. k8s-image-puller)
- Could reuse existing functionality (DB connection, rhche secret) to automatically look into tenant clusters and get configuration that Che uses.
Cons:
- Another service, image, and CI to maintain.
- Would have to share rhche secrets and be another service that has use of the rhche SA token
Service is a scheduled job of Che server.

Pros:
- Could be implemented fairly easily since e.g. database communication logic is available
- Would be easier to upstream, and simplify deployments if we chose to upstream it
Cons:
- Could be a long running job that bogs down normal Che functionality to only check for a few stray workspaces out of thousands
- Upstream utility is not clear
- If downstream, would need to be potentially maintained for each Che version.
- Upstreaming might not be meaningful as namespaces are handled differently.

Personally, I'm leaning towards option 2, but that's because we know how to deploy and update such a service. I don't think it's suitable to plan this sort of thing for upstream, since the assumptions are very different.

ibuziuk commented 4 years ago

@amisevsk IMO we should stick to a solution that could be re-used in the upstream (not necessarily embedded in che-server, it could be auxiliary deployment like k8s-image-puller). Speaking about 3. Service is a scheduled job of Che server. - isn't it smth that @sleshchenko already implemented upstream and we just need to make sure it works properly on Hosted Che side ?

ibuziuk commented 4 years ago

Talked with @sleshchenko and what we currently have in the upstream is - RuntimeHangingDetector - https://github.com/eclipse/che/blob/master/infrastructures/kubernetes/src/main/java/org/eclipse/che/workspace/infrastructure/kubernetes/RuntimeHangingDetector.java

Tracks STARTING/STOPPING runtimes and forcibly stop them if they did not change status before a timeout is reached.

amisevsk commented 4 years ago

Yeah, the RuntimeHangingDetector is a different case, since we can look for STARTING and STOPPING workspaces in the Che db and track those; for untracked deployments, the workspace may not even be in the database as STOPPED, if e.g. the user has deleted it.

The flow I would follow for this would be:

For each user, get everything with label che.workspace_id in their <username>-che namespace
For each workspace id there, check if workspace has RUNNING entry in database
If it doesn't remove all resources labelled che.workspace_id=<workspaceId> in that user's namespace.

The worry comes in where we have to scale to thousands of users.

ibuziuk commented 4 years ago

The worry comes in where we have to scale to thousands of users.

Could not we get the other way round:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id
(optional) identify those which are running for more than n (e.g. 24h) hours.
check if the workspace name (based on pod name) dedicated to the pod exists and check it's state (we might need to use admin account for that)
If workspace does not exist or status is stopped - remove all resources labelled che.workspace_id= in that user's namespace.

amisevsk commented 4 years ago

@ibuziuk The issue is an access problem:

get all workspace pods on the cluster e.g oc get pod --all-namespaces -o wide --selector che.workspace_id

This requires admin access to the tenant clusters, so we're no longer talking about something that runs in dsaas without a new config (I don't know if SD supports this flow).

If we want to run somewhere in dsaas (whether as a service or part of Che) we have to go through oso-proxy, which prevents something like oc get pod --all-namespaces AFAIK. Even if it would be allowed, we would have to do something hacky like we do for the k8s-image-puller to check all the tenant clusters (we use four test accounts that we know proxy into desired clusters).
If we don't care about being a dsaas service, then we have to deal with getting access to the Che database for step 3, and it's unclear on how such a job would be managed/automated.

ibuziuk commented 4 years ago

Closing, untracked deployments are currently expected to be tracked manually

redhat-developer / rh-che

Tool for monitoring untracked workspace deployments #1690