Open RoelantStegmann opened 6 years ago
I think the issue here might be that the cluster is under load, and that the livy session is not starting up because the cluster doesn't have enough resources to allocate to it. One way to verify is to go to the YARN UI and see resource usage when this fails: https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-job-debugging#track-an-application-in-the-yarn-ui
Hmm, this task is started right after the cluster is deployed (the process even sleeps 10 minutes in between to really make sure the cluster is ready). No other tasks run on this cluster...
Now when the notebook does start, it runs into this error at some point:
That error means that the cluster failed to respond to the requests that the notebook was making. Either Livy is failing to respond, or cluster is deleted, or some network connectivity error, or something like that.
When the session fails to start, look at Livy/Spark application YARN logs to see why it's failing. It should tell you what's happening, and if any errors are encountered.
I made a process using sparkmagic to communicate with an Azure cluster - very happy with this.
I then used nbrun (https://github.com/tritemio/nbrun) to automate the process. The main process spins up a cluster, adds some parameters in the beginning of the notebook, and then runs it, and in the end closes the cluster.
This almost always works. Only sometimes the notebook doesn't run at all as the kernel didn't start up correctly. See the error below. What could be the cause of this error?