Closed lucferbux closed 3 years ago
LGTM, you'll need to add the actual leader election image to the pod though, https://github.com/Xaenalt/jupyterhub-odh-ha/blob/abad9fe9519c18866767deebb2821289d12d3dd1/jupyterhub-dc.yaml#L101-L103
Also, is that s2i run script executed before any of the others in the pod, I added it originally to /opt/app-root/builder/run since that's what the JH pod invokes
https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images One thing we would want to note is if you scroll to the bottom of the above page and read the warning
When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.
Does this mean it wont invoke the actual jupyterhub run script?
@Xaenalt This is the associated PR to odh-manifests with the sidecar: https://github.com/opendatahub-io/odh-manifests/pull/460
https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images One thing we would want to note is if you scroll to the bottom of the above page and read the warning
When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.
Does this mean it wont invoke the actual jupyterhub run script?
yeah, I totally forgot to implement the exec command, it's just running the exec to notify when to start running the jupyter notebook, so instead of the echo "Became leader! Start the program!"
we should call the script.
https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images One thing we would want to note is if you scroll to the bottom of the above page and read the warning
When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.
Does this mean it wont invoke the actual jupyterhub run script?
I think is done now, it will only be called by the leader, as the echo ...
was only executed by the leader, I'm gonna try this approach to check if it works
As we are still discussing the sleep time I'm committing the fixes in separate commits, once it's sorted out ill squash them.
Oh, it didn't put these comments out there until I hit unresolve, my bad
Ok thanks, so I would then resolve @vpavlin comments, sorry, I haven't been able to test it cause my cluster is still down.
Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it
To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method
Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it
To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method
ok perfect 👍
Also sometimes I get this error:
Refreshing the page multiple times takes care of it though which is odd. Another thing that is odd is:
/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] [32minfo[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] [32minfo[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] [32minfo[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] [32minfo[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] [32minfo[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] [32minfo[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] [32minfo[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] [32minfo[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] [32minfo[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] [32minfo[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] [32minfo[39m: 201 POST /api/routes/user/mshah@redhat.com/public
It keeps trying to add route for a pod which exists.
Also sometimes I get this error:
Refreshing the page multiple times takes care of it though which is odd. Another thing that is odd is:
/user/mshah@redhat.com/public 08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public 08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes http://10.128.2.33:9090!=http://10.128.2.33:8080 ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='') /user/mshah@redhat.com/public 08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public 08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes http://10.128.2.33:9090!=http://10.128.2.33:8080 ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='') /user/mshah@redhat.com/public 08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090 08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
It keeps trying to add route for a pod which exists.
Ok, I've implemented your solution @maulikjs, as you stated the service was load balancing between the 3 pods even though two of them were not ready, hence the API route error. Adding readinessProbe to check if the service is up, so it can start accept traffic. Changes are live here: https://github.com/opendatahub-io/odh-manifests/pull/460
I can confirm this is due to the pods being reported as missing and the proxy routing the request to pods which are waiting for leader and arent running JH yet. We need to add a readinessProbe which wont report the pod as running until it is actually the leader.
We need to add a readinessProbe which wont report the pod as running until it is actually the leader.
Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?
We need to add a readinessProbe which wont report the pod as running until it is actually the leader.
Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?
Nope, we add either do not add a livenessProbe or empty livenessprobe which always evaluates to true. Based on how @lucferbux configured the deployment, the readinessProbe will check for a service running on port 8081 which is started when jupyterhub actually starts and thanks to this jupyterhub will only start when its the leader. So we will always only have 1 active pod with both containers running and 2 standby pods waiting in a not-ready state to get a lock and become the leader.
Once we add traefik into the mix this would mean user pods will be accessible no matter what and jupyterhub will at max have a downtime of ~12 seconds, 6 seconds for it to go through the leadership check and approx 6 more (as per my testing on osd) for the standby pod to start running and become the leader.
I haven't tested this with traefik yet but the only thing we need to look into is:
/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.
why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.
That's from the publish service which we are removing, so don't worry about it:)
perfect!
/lgtm
Awesome job
Related Issues and Dependencies
RHODS-767
This introduces a breaking change
Added the s2i run step to check if the image is the leader. By checking the port opened by the sidecar leader container only the leader will start the execution.
This Pull Request implements
Description
Based on the work of of Sean add check method for the kubernetes leader election pattern.
This PR is linked to the Added sidecar leader election on JH PR
Testing
This image could be built with the following command:
For further info you can check this test log