jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters
Other
1.33k stars 447 forks source link

sparkmagic sometimes fails #415

Open RoelantStegmann opened 6 years ago

RoelantStegmann commented 6 years ago

I made a process using sparkmagic to communicate with an Azure cluster - very happy with this.

I then used nbrun (https://github.com/tritemio/nbrun) to automate the process. The main process spins up a cluster, adds some parameters in the beginning of the notebook, and then runs it, and in the end closes the cluster.

This almost always works. Only sometimes the notebook doesn't run at all as the kernel didn't start up correctly. See the error below. What could be the cause of this error?

sparkmagic

aggFTW commented 6 years ago

I think the issue here might be that the cluster is under load, and that the livy session is not starting up because the cluster doesn't have enough resources to allocate to it. One way to verify is to go to the YARN UI and see resource usage when this fails: https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-job-debugging#track-an-application-in-the-yarn-ui

RoelantStegmann commented 6 years ago

Hmm, this task is started right after the cluster is deployed (the process even sleeps 10 minutes in between to really make sure the cluster is ready). No other tasks run on this cluster...

Now when the notebook does start, it runs into this error at some point:

image

aggFTW commented 6 years ago

That error means that the cluster failed to respond to the requests that the notebook was making. Either Livy is failing to respond, or cluster is deleted, or some network connectivity error, or something like that.

aggFTW commented 6 years ago

When the session fails to start, look at Livy/Spark application YARN logs to see why it's failing. It should tell you what's happening, and if any errors are encountered.