Closed skabashnyuk closed 4 years ago
I think https://github.com/redhat-developer/rh-che/issues/1691 this can cause Untracked workspace deployments too.
untracked deployments is the side-effect of the https://github.com/eclipse/che/issues/15006 I will proceed with cleanup once we have a fix on production
this is not the first time when we have such a thing and not the last time, is that possible to have some general solution to report and clean up such things?
is that possible to have some general solution to report and clean up such things?
It may be possible but this would be hard to envision as a Che feature; how we check users and their (unique) namespaces in a way that scales to thousands of users?
Before sinking significant time into implementing something along these lines, I think some more design discussion is required.
As I see it, there are three options for implementing this functionality, in terms of where and how this service would run.
The workspace tracker runs separately from all clusters, potentially as a cron job or similar.
Pros:
pods = oc get po -l che.workspace.id
for each pod in pods:
check if workspace is running in DB
remove resources if it isn't
Cons:
Runs as a separate service in the dsaas/preview cluster
Pros:
Cons:
Service is a scheduled job of Che server.
Pros:
Cons:
Personally, I'm leaning towards option 2, but that's because we know how to deploy and update such a service. I don't think it's suitable to plan this sort of thing for upstream, since the assumptions are very different.
@amisevsk IMO we should stick to a solution that could be re-used in the upstream (not necessarily embedded in che-server, it could be auxiliary deployment like k8s-image-puller).
Speaking about 3. Service is a scheduled job of Che server.
- isn't it smth that @sleshchenko already implemented upstream and we just need to make sure it works properly on Hosted Che side ?
Talked with @sleshchenko and what we currently have in the upstream is - RuntimeHangingDetector - https://github.com/eclipse/che/blob/master/infrastructures/kubernetes/src/main/java/org/eclipse/che/workspace/infrastructure/kubernetes/RuntimeHangingDetector.java
Tracks STARTING/STOPPING runtimes and forcibly stop them if they did not change status before a timeout is reached.
Yeah, the RuntimeHangingDetector is a different case, since we can look for STARTING
and STOPPING
workspaces in the Che db and track those; for untracked deployments, the workspace may not even be in the database as STOPPED
, if e.g. the user has deleted it.
The flow I would follow for this would be:
che.workspace_id
in their <username>-che
namespaceRUNNING
entry in databaseche.workspace_id=<workspaceId>
in that user's namespace.The worry comes in where we have to scale to thousands of users.
The worry comes in where we have to scale to thousands of users.
Could not we get the other way round:
oc get pod --all-namespaces -o wide --selector che.workspace_id
@ibuziuk The issue is an access problem:
get all workspace pods on the cluster e.g
oc get pod --all-namespaces -o wide --selector che.workspace_id
This requires admin access to the tenant clusters, so we're no longer talking about something that runs in dsaas without a new config (I don't know if SD supports this flow).
oc get pod --all-namespaces
AFAIK. Even if it would be allowed, we would have to do something hacky like we do for the k8s-image-puller to check all the tenant clusters (we use four test accounts that we know proxy into desired clusters).Closing, untracked deployments are currently expected to be tracked manually
Issue problem: During testing https://github.com/eclipse/che/issues/15006 on che.openshift.io I've got into the situation when some exception happens during workspace stop/delete. That exception caused that
Red Hat Che version:
version: (help/about menu)
Reproduction Steps:
Describe how to reproduce the problem
Runtime:
runtime used:
minishift version
)oc version
)