kubeflow / kfctl

kfctl is a CLI for deploying and managing Kubeflow
Apache License 2.0
181 stars 137 forks source link

possible race/stall condition in operator when namespace terminating #465

Open thoraxe opened 4 years ago

thoraxe commented 4 years ago

Scenario:

Create namespace ns1 Create a kfdef in ns1 Delete namespace ns1, which puts it into terminating status Create namespace ns2 Create a kfdef in ns2

Result: The operator gets stalled/hung/races on resources. For example:

time="2020-12-02T14:34:11Z" level=warning msg="Encountered error applying application jupyterhub:  (kubeflow.error): Code 500 with message: Apply.Run : [error when creating \"/tmp/kout449770606\": secrets \"jupyterhub\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout449770606\": configmaps \"parameters\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout449770606\": configmaps \"spark-cluster-template\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout449770606\": configmaps \"jupyter-singleuser-profiles\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout449770606\": configmaps \"jupyterhub-cfg\" is forbidden: unable to create new content in namespace workflowsz because it is being termi...
time="2020-12-02T14:34:11Z" level=warning msg="Will retry in 19 seconds."
time="2020-12-02T14:34:31Z" level=warning msg="Encountered error applying application jupyterhub:  (kubeflow.error): Code 500 with message: Apply.Run : [error when creating \"/tmp/kout516991221\": secrets \"jupyterhub\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout516991221\": configmaps \"parameters\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout516991221\": configmaps \"spark-cluster-template\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout516991221\": configmaps \"jupyter-singleuser-profiles\" is forbidden: unable to create new content in namespace workflowsz because it is being terminated, error when creating \"/tmp/kout516991221\": configmaps \"jupyterhub-cfg\" is forbidden: unable to create new content in namespace workflowsz because it is being termi...
time="2020-12-02T14:34:31Z" level=warning msg="Will retry in 38 seconds."

For whatever reason, the kfdef hangs around in ns1. As soon as I deleted the kfdef from ns1, the operator got unblocked, happily deleted the resources, and then finally created the resources in ns2.

In other words, the operator was stuck waiting for ns1/kfdef to be happy before it would go on to create the resources in ns2.

Suggestion: If the operator notices that a namespace is in the Terminating state, it should delete the kfdef, which would then have the operator delete the resources, and etc.

OR

Don't get stuck/hung up on errors in one namespace preventing other namespaces from working.