globus / globus-compute

Globus Compute: High Performance Function Serving for Science
https://www.globus.org/compute
Apache License 2.0
148 stars 47 forks source link

Properly kill manager when a ManagerLost problem happens on k8s #255

Open ZhuozhaoLi opened 4 years ago

ZhuozhaoLi commented 4 years ago

Currently interchange does not force a manager on k8s to kill when a ManagerLost problem happens on k8s, and the manager will keep crashloop and stay there. Relevant to #254

ryanchard commented 4 years ago

This seems like an oversight in our k8s provider. We need the interchange to correctly scaledown the pod of the lost manager, rather than let the manager kill itself as k8s will automatically try to restart it.

BenGalewsky commented 3 years ago

Maybe migrate the Kubernetes manager to use native Kubernetes job management?

ZhuozhaoLi commented 3 years ago

@BenGalewsky I tried a first version to migrate it to job on the branch k8s-manager-no-restart, but it shows the following error on River. Think that is because of the river permission problem?

HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 25 May 2021 00:36:28 GMT', 'Content-Length': '325'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:serviceaccount:parsl:funcx-test-funcx-endpoint\" cannot create resource \"jobs\" in API group \"batch\" in the namespace \"parsl\"","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
benclifford commented 2 years ago

crossref parsl issue (actually PR) https://github.com/Parsl/parsl/pull/2171 which is a similar problem in Parsl with HTCondor