jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters
Other
1.33k stars 447 forks source link

Sparkmagic cannot detect session from YARN cluster. #511

Open rodmaz opened 5 years ago

rodmaz commented 5 years ago

This is a weird issue.

We are running an AWS EMR (5.20.0) cluster with Hadoop, Spark, Livy and JupyterHub. Cluster is working fine and Livy also is working fine (we can submit/query jobs w/o authentication).

However whenever we start a notebook using kernel PySpark3, the following error occurs:

error

However Livy does start the job on the cluster as the Livy session log shows:

19/02/06 16:56:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(livy, rodrigo); groups with view permissions: Set(); users  with modify permissions: Set(livy, rodrigo); groups with modify permissions: Set()
19/02/06 16:56:18 INFO Client: Submitting application application_1549469807970_0001 to ResourceManager
19/02/06 16:56:18 INFO YarnClientImpl: Submitted application application_1549469807970_0001
19/02/06 16:56:18 INFO Client: Application report for application_1549469807970_0001 (state: ACCEPTED)
19/02/06 16:56:18 INFO Client: 
     client token: N/A
     diagnostics: [Wed Feb 06 16:56:18 +0000 2019] Application is Activated, waiting for resources to be assigned for AM.  Details : AM Partition = CORE ; Partition Resource = <memory:4096, vCores:8> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; 
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1549472178421
     final status: UNDEFINED
     tracking URL: http://<redacted>:20888/proxy/application_1549469807970_0001/
     user: rodrigo
19/02/06 16:56:18 INFO ShutdownHookManager: Shutdown hook called
19/02/06 16:56:18 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-3fd2fc39-be1e-47c4-8001-89ba5f09f35e
19/02/06 16:56:18 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dee10424-9993-45ec-98f1-af35fa4b57f7

Application is also running on Hadoop YARN/Spark as we can see:

hadoop

Also looking inside the Docker container running JupyterHub we see no errors in the log:

[D 2019-02-06 17:19:54.432 SingleUserNotebookApp kernelmanager:392] activity on ae2e785d-0803-49fc-bea9-fcd553ed9d4c: status
[D 2019-02-06 17:19:54.437 SingleUserNotebookApp kernelmanager:392] activity on ae2e785d-0803-49fc-bea9-fcd553ed9d4c: execute_input
[D 2019-02-06 17:19:54.516 SingleUserNotebookApp kernelmanager:392] activity on ae2e785d-0803-49fc-bea9-fcd553ed9d4c: stream
[D 2019-02-06 17:19:55.573 SingleUserNotebookApp auth:778] Allowing whitelisted Hub user rodrigo
[D 2019-02-06 17:19:55.600 SingleUserNotebookApp log:158] 200 GET /user/rodrigo/static/base/images/favicon-busy-1.ico (rodrigo@::ffff:<IP-redacted>) 29.40ms
[D 2019-02-06 17:20:45.719 SingleUserNotebookApp auth:778] Allowing whitelisted Hub user rodrigo
[D 2019-02-06 17:20:45.721 SingleUserNotebookApp genericmanager:72] S3contents.GenericManager.get] path('/TestApp.ipynb') type(None) format(None)
[D 2019-02-06 17:20:45.721 SingleUserNotebookApp genericmanager:93] S3contents.GenericManager.get_notebook: path('TestApp.ipynb') type(0) format(None)
[D 2019-02-06 17:20:45.797 SingleUserNotebookApp s3_fs:92] S3contents.S3FS: `<bucket-redacted>/jupyter/rodrigo/TestApp.ipynb` is a file: True
[I 2019-02-06 17:20:45.821 SingleUserNotebookApp log:158] 200 GET /user/rodrigo/api/contents/TestApp.ipynb?content=0&_=1549472083068 (rodrigo@::ffff:172.50.107.145) 103.17ms
[D 2019-02-06 17:20:45.982 SingleUserNotebookApp auth:778] Allowing whitelisted Hub user rodrigo
[D 2019-02-06 17:20:45.983 SingleUserNotebookApp genericmanager:62] S3contents.GenericManager.file_exists: ('/TestApp.ipynb')
[D 2019-02-06 17:20:46.058 SingleUserNotebookApp s3_fs:92] S3contents.S3FS: `<bucket-redacted>/jupyter/rodrigo/TestApp.ipynb` is a file: True
[I 2019-02-06 17:20:46.059 SingleUserNotebookApp handlers:164] Saving file at /TestApp.ipynb
[D 2019-02-06 17:20:46.059 SingleUserNotebookApp genericmanager:182] S3contents.GenericManager: save {'type': 'notebook', 'content': {'cells': [{'metadata': {'trusted': True, 'scrolled': True}, 'cell_type': 'code', 'source': 'print("test")', 'execution_count': None, 'outputs': [{'output_type': 'stream', 'text': 'Starting Spark application\n', 'name': 'stdout'}]}, {'metadata': {'trusted': True}, 'cell_type': 'code', 'source': '', 'execution_count': None, 'outputs': []}], 'metadata': {'kernelspec': {'name': 'pyspark3kernel', 'display_name': 'PySpark3', 'language': ''}, 'language_info': {'name': 'pyspark3', 'mimetype': 'text/x-python', 'codemirror_mode': {'name': 'python', 'version': 3}, 'pygments_lexer': 'python3'}}, 'nbformat': 4, 'nbformat_minor': 2}}: '/TestApp.ipynb'
[D 2019-02-06 17:20:46.134 SingleUserNotebookApp s3_fs:185] S3contents.S3FS: Writing notebook: `<bucket-redacted>/jupyter/rodrigo/TestApp.ipynb`
[D 2019-02-06 17:20:46.186 SingleUserNotebookApp genericmanager:72] S3contents.GenericManager.get] path('/TestApp.ipynb') type(notebook) format(None)
[D 2019-02-06 17:20:46.186 SingleUserNotebookApp genericmanager:93] S3contents.GenericManager.get_notebook: path('TestApp.ipynb') type(False) format(None)
[D 2019-02-06 17:20:46.254 SingleUserNotebookApp s3_fs:92] S3contents.S3FS: `com-useaurea--sa-east-1-emr-apps-development/jupyter/rodrigo/TestApp.ipynb` is a file: True
[I 2019-02-06 17:20:46.275 SingleUserNotebookApp log:158] 200 PUT /user/rodrigo/api/contents/TestApp.ipynb (rodrigo@::ffff:<IP-redacted>) 294.00ms
[D 2019-02-06 17:20:56.417 SingleUserNotebookApp kernelmanager:392] activity on ae2e785d-0803-49fc-bea9-fcd553ed9d4c: stream
[D 2019-02-06 17:20:56.426 SingleUserNotebookApp kernelmanager:392] activity on ae2e785d-0803-49fc-bea9-fcd553ed9d4c: status

Any ideas why can't JupyterHub and Sparkmagic detect the successful Spark session created? This problem makes it impossible to run Jupyter notebooks on our cluster. Thanks.

apetresc commented 5 years ago

Hmm, my first thought is that this is a problem on Livy's end. Could you possibly try to create an empty session directly through the Livy REST API's sessions endpoint, using curl or somesuch? If the same thing happens there, I think you'll have to file the bug against them.

ericdill commented 5 years ago

Can you post your container logs? those are usually pretty instructive as to what's going on when things don't go as expected. Another thing to try: Can you start a pyspark shell or spark-shell locally on the instance where your Livy server is running?

Tagar commented 5 years ago

Perhaps it took more than 60 seconds to create spark session / yarn application in your cluster?

sparkmagic waits for just 60 seconds by default which may not be enough on very busy clusters when YARN has to wait to start resource preemption to get resources from other resource queues

on sparkmagic side -

import sparkmagic.utils.configuration as livy_conf
livy_conf.override(livy_conf.livy_session_startup_timeout_seconds.__name__, 300)

on livy side in livy.conf add -

livy.server.yarn.app-lookup-timeout = 300s
sunayansaikia commented 1 year ago

@rodmaz -- were you able to find a solution to this?