radanalyticsio / spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.
Apache License 2.0
157 stars 61 forks source link

spark operator crashes if it fails to hit spark-cluster service #284

Open erikerlandson opened 4 years ago

erikerlandson commented 4 years ago

I made the mistake of manually deleting some spark cluster pods that had been spun up by odh via this operator. The spark operator got a REST call failure when it tried to hit the corresponding spark master service endpoint during its reconciliation pass, and the entire operator crashed. Once that happened, it also failed to respond to new cluster requests coming from ODH - like it wasn't ever making it to the "new cluster" part of the reconciliation pass.

Anyway, should make the various operations during the reconciliation pass error tolerant, so if it encounters some exception on one operation, it can isolate that in try/catch and move on to reconcile all the other spark clusters it is managing gracefully.