Open ZhuozhaoLi opened 4 years ago
This seems like an oversight in our k8s provider. We need the interchange to correctly scaledown the pod of the lost manager, rather than let the manager kill itself as k8s will automatically try to restart it.
Maybe migrate the Kubernetes manager to use native Kubernetes job management?
@BenGalewsky I tried a first version to migrate it to job on the branch k8s-manager-no-restart, but it shows the following error on River. Think that is because of the river permission problem?
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 25 May 2021 00:36:28 GMT', 'Content-Length': '325'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:parsl:funcx-test-funcx-endpoint\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"parsl\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
crossref parsl issue (actually PR) https://github.com/Parsl/parsl/pull/2171 which is a similar problem in Parsl with HTCondor
Currently interchange does not force a manager on k8s to kill when a ManagerLost problem happens on k8s, and the manager will keep crashloop and stay there. Relevant to #254