opendatahub-io / kubeflow

Machine Learning Toolkit for Kubernetes
Apache License 2.0
10 stars 32 forks source link

Long Running Notebook Testing Support #151

Open harshad16 opened 1 year ago

harshad16 commented 1 year ago

/kind feature

Why you need this feature:

With Notebook-controller changes, we need to be careful with making changes to the reconciliation loop, the change would result in restarting all the Notebooks already running in a cluster. As a user of the notebook, might be working/not saving work when the upgrade happens, so this makes it a reason for concern.

Describe the solution you'd like:

Include Tests for checking long-running notebooks to catch changes that would cause such issues.

Anything else you would like to add:

Previously, we have fallen into the trap as we were missing this kind of testing. Example: Original inclusion: https://github.com/opendatahub-io/kubeflow/commit/41baf497ee3559890a6a5fa1f4ba4eb4a69c61a2 reverted due to the restart of notebooks: https://github.com/opendatahub-io/kubeflow/commit/588142611a3c6cbbe72a5ad7ca66f2aedab3dc22

shalberd commented 1 year ago

the change would result in restarting all the Notebooks already running in a cluster

It depends on which change you mean.

Regarding the new image change trigger annotation in the notebook CR metadata:

@atheo89 @harshad16 No, it would not. Or, to be more precise: the coming change for auto-fill in of image field value when a Notebook and StatefulSet has the image content change trigger annotation is not critical here. The image change trigger annotation is only added on newly-created notebooks, not to old notebooks.

The controller only changes reconcile logic when annotation is set https://github.com/opendatahub-io/kubeflow/pull/133/files#diff-a436986c5e93996f9035eee6798d064ab8659957e8443eb44b3a4d3db289bde4R207

About the annotation: Done in odh-dashboard code for newly-created notebook CRs via odh-dashboard PR-800. Now, if someone were to deliberately modify an existing notebook via ODH dashboard, setting a different underlying imagestream name and tag, only then would the odh-dashboard annotation be added in a notebook CR. But I don't think that is your use case with long-running notebooks and the notebook controller.

Regarding PodSpec changes done by odh-notebook-controller:

Regarding any changes in the PodSpec, you are correct, though. When for example any argument of the ose-oauth container is modified via odh-notebook-controller, that does have an impact on notebook restarts, independent of the image change trigger annotation mechanism. You could exclude the ose-oauth-section PodSpec from DeepCompare in kubeflow-notebook-controller as well, but I don't think that is a good idea. Better way would be to only change redirectUrl on newly-created notebooks, not on already existing ones (probably odh-notebook-controller code).

@harshad16 @atheo89 regarding long-running notebooks: The fact that a container restarts when any of its parts are modified in the PodSpec is not possible to avoid, I think, in Kubernetes and Openshift terms. You'd have to selectively set the Pod restartPolicy to Never, which opens another can of worms. https://stackoverflow.com/questions/53262133/kubernetes-stop-containers-from-restarting-automatically. The previous change to add shared memory specific volumes and volumemounts to the podSpec was such a case, of course that led to Pod / container restarts.

It really depends on what the team working with long-running notebooks wants and how deep its integration with ODH dashboard is wished to be. When ODH dashboard assembles a notebook CR, it does not set any specific value for the podTemplate restartPolicy, meaning behavior is "Always" https://github.com/opendatahub-io/odh-dashboard/blob/10818c8cd977f46909a29809474504b5e942e6a4/frontend/src/api/k8s/notebooks.ts#L119

harshad16 commented 1 year ago

@shalberd this issue is for a general test support, for each PR. The long running notebook get restarted when there is any change to statefulsets or PodSpec upgrade. As noted in comment as well.

As pointed in the description , this was opened in response to the some changes , we made which caused issues with the long running notebook and we had to revert. and is not related to changes in annotations.