Closed dnks23 closed 4 years ago
Hi @dnks23. I believe the "bloated" image you're referring to is enterprise-gateway-demo
and is used for demonstration and integration tests and, yes, contains a complete Hadoop/Spark server. The image that would apply to this scenario is enterprise-gateway
.
This is similar to #814 but with fewer restrictions. It's similar in that the launched kernels must communicate their connection information back to EG on the response address sent to them during launch. Today, each kernel launch responds to a different port hosted by EG. In EG 3.0 I hope to have a single response address to use so that it can be "published" in these kinds of environments.
I suspect you could circumvent this issue by setting up host-networking so that the response address port is accessible from the regular network. You may need to set the EG_PROHIBITED_LOCAL_IPS
environment variable so that the internal docker IP (192.*) doesn't get produced in the response address calculation. We also have an EG_RESPONSE_IP
env that could be used to pin the IP to which the remote kernels send their connection information.
Please give it a try and let us know how you progress. Thanks for you interest.
Thanks for the good reply @kevin-bates
So I spend some days trying to get this to work. Looks like it kind of does now with passing EG_RESPONSE_IP
in the kernel.json
. BTW: Is this the right place to set this? Can it also be done on a global-basis and not on the kernel-level?
However I am still facing one issue:
I do connect with a Jupiter lab (from my local machine) with the --gateway-url
option to the Gateway. I can select the kernel to start, application in yarn gets scheduled and goes to ACCEPTED
state & logs look good but suddenly I get:
[W 2020-09-11 14:19:10.303 EnterpriseGatewayApp] Termination of application 'application_1599816470122_0023' failed with exception: 'Response finished with status: 405. Details: '. Continuing...
[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1599816470122_0023, kernel ID: 410c5e3b-60ae-4a29-90bb-abe4659cec4f, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7ff3481fe200>
[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] response socket still open, close it
[E 2020-09-11 14:19:15.359 EnterpriseGatewayApp] KernelID: '410c5e3b-60ae-4a29-90bb-abe4659cec4f' launch timeout due to: YARN resources unavailable after 46.0 seconds for app application_1599816470122_0023, launch timeout: 40.0! Check YARN configuration.
The Notebook throws the Error on Kernel startup
but what I see in the Yarn-UI is that the Application is Running!
I played a bit with EG_KERNEL_LAUNCH_TIMEOUT
but couldn't get this to work. Any hints on this maybe?
Thanks in advance!
Generally speaking, all envs prefixed with EG_
are read and processed by the Enterprise Gateway, not kernels. When EG_RESPONSE_IP
is set, rather than use the local IP determined from within EG, it "blindly" uses that IP to formulate the value for the --RemoteProcessProxy.response_address
value used during the launch of each remote kernel. Because your EG is running within a container (and not the local host) you will likely need to set it to the host's IP, since what is determined programmatically will likely be a docker IP. If your current response address is 192.*
you'll likely need to add that wildcarded value to EG_PROHIBITED_LOCAL_IPS
.
What is likely happening is that the launched kernel is unable to respond back to EG with its connection information on this response address. Since the YARN application is essentially "on its own" at this point (due to other issues) and not terminated (as should have been the case) it will eventually go to a running state. However, since EG has lost "sight" of the kernel, you'll need to terminate the YARN application manually until you get things properly configured. At least this tells us the launch aspect of things is working, just not the required communication back to EG.
What should have happened, even in a failure state, is that EG should have noticed the faulty communication and issued a kill request against the YARN API. But that attempt failed here:
[W 2020-09-11 14:19:10.303 EnterpriseGatewayApp] Termination of application 'application_1599816470122_0023' failed with exception: 'Response finished with status: 405. Details: '. Continuing...
and this coroutine portion doesn't look correct and implies version-related issues of some sort...
[D 2020-09-11 14:19:15.358 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1599816470122_0023, kernel ID: 410c5e3b-60ae-4a29-90bb-abe4659cec4f, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7ff3481fe200>
leading me to the following questions:
pip freeze
)Thanks for the detailed explanations! Let me try to answer your questions:
pip freeze
:This is the output of pip freeze
in the docker container:
adal @ file:///home/conda/feedstock_root/build_artifacts/adal_1591523462562/work
appdirs==1.4.3
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1596629848788/work
asn1crypto @ file:///home/conda/feedstock_root/build_artifacts/asn1crypto_1595949944546/work
async-generator==1.10
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1599308529326/work
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache==1.6.1
bcrypt @ file:///home/conda/feedstock_root/build_artifacts/bcrypt_1597603498735/work
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1588608214987/work
blinker==1.4
brotlipy==0.7.0
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1593420445823/work
certifi==2020.6.20
cffi @ file:///tmp/build/80754af9/cffi_1598370813909/work
chardet==3.0.4
conda==4.8.4
conda-package-handling==1.7.0
cryptography @ file:///tmp/build/80754af9/cryptography_1598892037289/work
decorator==4.4.2
defusedxml==0.6.0
docker @ file:///home/conda/feedstock_root/build_artifacts/docker-py_1598002850473/work
docker-pycreds==0.4.0
entrypoints==0.3
future==0.18.2
google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1599161935305/work
idna @ file:///tmp/build/80754af9/idna_1593446292537/work
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1593211369179/work
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1595446871027/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1598749946943/work
ipython-genutils==0.2.0
jedi==0.15.2
Jinja2==2.11.2
jsonschema==3.2.0
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1598486169312/work
jupyter-core==4.6.3
jupyter-enterprise-gateway @ file:///home/conda/feedstock_root/build_artifacts/jupyter_enterprise_gateway_1599583981662/work
jupyterlab-pygments==0.1.1
kubernetes @ file:///home/conda/feedstock_root/build_artifacts/python-kubernetes_1588618980778/work
MarkupSafe==1.1.1
mistune==0.8.4
nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1598558657104/work
nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert_1599968913603/work
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1594060262917/work
nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1594996608835/work
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1599742225943/work
oauthlib==3.0.1
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1589925210001/work
pandocfilters==1.4.2
paramiko @ file:///home/conda/feedstock_root/build_artifacts/paramiko_1598988093657/work
parso==0.5.2
pexpect==4.8.0
pickleshare==0.7.5
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1590412252446/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1598885455507/work
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycosat==0.6.3
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
pycryptodomex @ file:///home/conda/feedstock_root/build_artifacts/pycryptodomex_1593612858917/work
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1599933597045/work
PyJWT==1.7.1
pykerberos==1.2.1
PyNaCl==1.3.0
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1594392929924/work
pyparsing==2.4.7
pyrsistent @ file:///home/conda/feedstock_root/build_artifacts/pyrsistent_1599988183291/work
PySocks @ file:///tmp/build/80754af9/pysocks_1594394576006/work
python-dateutil==2.8.1
PyYAML==5.3.1
pyzmq==19.0.2
requests @ file:///tmp/build/80754af9/requests_1592841827918/work
requests-kerberos @ file:///home/conda/feedstock_root/build_artifacts/requests-kerberos_1591363821346/work
requests-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/requests-oauthlib_1595492159598/work
rsa @ file:///home/conda/feedstock_root/build_artifacts/rsa_1591996208734/work
ruamel-yaml==0.15.87
Send2Trash==1.5.0
six==1.15.0
terminado==0.8.3
testpath==0.4.4
tornado==6.0.4
tqdm @ file:///tmp/build/80754af9/tqdm_1596810128862/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1599471676085/work
urllib3 @ file:///tmp/build/80754af9/urllib3_1597086586889/work
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1595859607677/work
webencodings==0.5.1
websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1598373690950/work
yarn-api-client==1.0.2
zipp==3.1.0
As I am not sure whether that is sufficient as I am running EG with conda also provide you with conda list
from the docker container:
# Name Version Build Channel
_libgcc_mutex 0.1 main
adal 1.2.4 pyh9f0ad1d_0 conda-forge
appdirs 1.4.3 py_1 conda-forge
argon2-cffi 20.1.0 py37h8f50634_1 conda-forge
asn1crypto 1.4.0 pyh9f0ad1d_0 conda-forge
async_generator 1.10 py_0 conda-forge
attrs 20.2.0 pyh9f0ad1d_0 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 py_2 conda-forge
backports.functools_lru_cache 1.6.1 py_0 conda-forge
bcrypt 3.2.0 py37h8f50634_0 conda-forge
bleach 3.1.5 pyh9f0ad1d_0 conda-forge
blinker 1.4 py_1 conda-forge
brotlipy 0.7.0 py37h7b6447c_1000
ca-certificates 2020.6.20 hecda079_0 conda-forge
cachetools 4.1.1 py_0 conda-forge
certifi 2020.6.20 py37hc8dfbb8_0 conda-forge
cffi 1.14.2 py37he30daa8_0
chardet 3.0.4 py37_1003
conda 4.8.4 py37hc8dfbb8_2 conda-forge
conda-package-handling 1.6.1 py37h7b6447c_0
cryptography 3.1 py37h1ba5d50_0
decorator 4.4.2 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
docker-py 4.3.1 py37hc8dfbb8_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
entrypoints 0.3 py37hc8dfbb8_1001 conda-forge
future 0.18.2 py37hc8dfbb8_1 conda-forge
gmp 6.2.0 he1b5a44_2 conda-forge
google-auth 1.21.1 py_0 conda-forge
idna 2.10 py_0
importlib-metadata 1.7.0 py37hc8dfbb8_0 conda-forge
importlib_metadata 1.7.0 0 conda-forge
ipykernel 5.3.4 py37h43977f1_0 conda-forge
ipython 7.18.1 py37hc6149b9_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.15.2 py37_0 conda-forge
jinja2 2.11.2 pyh9f0ad1d_0 conda-forge
jsonschema 3.2.0 py37hc8dfbb8_1 conda-forge
jupyter_client 6.1.7 py_0 conda-forge
jupyter_core 4.6.3 py37hc8dfbb8_1 conda-forge
jupyter_enterprise_gateway 2.2.0 py_0 conda-forge
jupyterlab_pygments 0.1.1 pyh9f0ad1d_0 conda-forge
krb5 1.17.1 hfafb76e_3 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libsodium 1.0.18 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
markupsafe 1.1.1 py37h8f50634_1 conda-forge
mistune 0.8.4 py37h8f50634_1001 conda-forge
nbclient 0.5.0 py_0 conda-forge
nbconvert 6.0.2 py37hc8dfbb8_0 conda-forge
nbformat 5.0.7 py_0 conda-forge
ncurses 6.2 he6710b0_1
nest-asyncio 1.4.0 py_0 conda-forge
notebook 6.1.4 py37hc8dfbb8_0 conda-forge
oauthlib 3.0.1 py_0 conda-forge
openssl 1.1.1g h516909a_1 conda-forge
packaging 20.4 pyh9f0ad1d_0 conda-forge
pandoc 2.10.1 h516909a_0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
paramiko 2.7.2 pyh9f0ad1d_0 conda-forge
parso 0.5.2 py_0
pexpect 4.8.0 py37hc8dfbb8_1 conda-forge
pickleshare 0.7.5 py37hc8dfbb8_1001 conda-forge
pip 20.2.2 py37_0
prometheus_client 0.8.0 pyh9f0ad1d_0 conda-forge
prompt-toolkit 3.0.7 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.7 py_0 conda-forge
pycosat 0.6.3 py37h7b6447c_0
pycparser 2.20 py_2
pycryptodomex 3.9.8 py37h8f50634_0 conda-forge
pygments 2.7.0 py_0 conda-forge
pyjwt 1.7.1 py_0 conda-forge
pykerberos 1.2.1 py37h74a8448_2 conda-forge
pynacl 1.3.0 py37h516909a_1001 conda-forge
pyopenssl 19.1.0 py_1
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pyrsistent 0.17.3 py37h8f50634_0 conda-forge
pysocks 1.7.1 py37_1
python 3.7.9 h7579374_0
python-dateutil 2.8.1 py_0 conda-forge
python-kubernetes 11.0.0 py37hc8dfbb8_0 conda-forge
python_abi 3.7 1_cp37m conda-forge
pyyaml 5.3.1 py37h8f50634_0 conda-forge
pyzmq 19.0.2 py37hac76be4_0 conda-forge
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
requests-kerberos 0.12.0 py37hc8dfbb8_1 conda-forge
requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge
rsa 4.6 pyh9f0ad1d_0 conda-forge
ruamel_yaml 0.15.87 py37h7b6447c_1
send2trash 1.5.0 py_0 conda-forge
setuptools 49.6.0 py37_0
six 1.15.0 py_0
sqlite 3.33.0 h62c20be_0
terminado 0.8.3 py37hc8dfbb8_1 conda-forge
testpath 0.4.4 py_0 conda-forge
tk 8.6.10 hbc83047_0
tornado 6.0.4 py37h8f50634_1 conda-forge
tqdm 4.48.2 py_0
traitlets 5.0.4 py_0 conda-forge
urllib3 1.25.10 py_0
wcwidth 0.2.5 pyh9f0ad1d_1 conda-forge
webencodings 0.5.1 py_1 conda-forge
websocket-client 0.57.0 py37hc8dfbb8_2 conda-forge
wheel 0.35.1 py_0
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
yarn-api-client 1.0.2 py_2 conda-forge
zeromq 4.3.2 he1b5a44_3 conda-forge
zipp 3.1.0 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
Hadoop/Yarn version of Docker-Host: 2.10 Hadoop/Yarn version of Container: 2.7.7 Spark version: 2.4.6
--conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster
to the workers.[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] Starting kernel (async): ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735', '--RemoteProcessProxy.por
t-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] Launching kernel: spark-python-yarn-cluster with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735',
'--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2020-09-14 07:52:06.430 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'KERNEL
_USERNAME': 'guest', 'KERNEL_LAUNCH_TIMEOUT': '40', 'KERNEL_WORKING_DIR': '/Users/user/path/work_tmp', 'EG_KERNEL_LAUNCH_TIMEOUT': '600', 'HADOOP_CONF_DIR': '/opt/hadoop/etc/hadoop/', 'SPARK_HOME': '/opt/spark', 'SPARK_CONF_DIR': '/opt/spark/conf', 'PYSPARK_PYTHON': '/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python', 'PYTHONPATH': '/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python', 'SPARK_OPTS': '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --con
f spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=. spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/lo
cal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ${KERNEL_EXTRA_SPARK_OPTS} --conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster ', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_R
ETRIES': '5', 'KERNEL_ID': '4798307d-dc70-452e-9d0c-0c4e87cb3356', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2020-09-14 07:52:06.438 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN RM address: http://yarn-m:8088, pid: 11, Kernel ID: 4798307d-dc70-452e-9d0c-0c4e87cb3356, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '4798307d-dc70-
452e-9d0c-0c4e87cb3356', '--RemoteProcessProxy.response-address', '10.132.0.25:39735', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'
Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user guest
+ eval exec /opt/spark/bin/spark-submit '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=.spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ${KERNEL_EXTRA_SPARK_OPTS} --conf spark.yarn.dist.archives=/usr/local/sha
re/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster ' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id 4798307d-dc70-452e-9d0c-0c4e87cb3356 --RemoteProcessProxy.response-address 10.132.0.25:39735 --RemoteProcessP
roxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --name 4798307d-dc70-452e-9d0c-0c4e87cb3356 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= spark_python_yarn_cluster/spark_python_yarn_cluster/bin/python --conf spark.yarn.appMasterEnv.PATH=.
spark_python_yarn_cluster/spark_python_yarn_cluster/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64//jre:/opt/spark/bin:/opt/spark/sbin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --conf spark.yarn.dist.archives=/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster.zip#spark_python_yarn_cluster /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id 4798307d-dc70-452e-9d0c-0c4e87cb3356 --RemoteProcessProxy.response-address 10.132.0.25:39735 --RemoteProcessProxy.port-range 0..0 --RemoteProcessPr
oxy.spark-context-initialization-mode lazy
[D 2020-09-14 07:52:06.975 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:07.484 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:07.991 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:08.499 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:09.012 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:09.521 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:10.029 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:10.537 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:11.048 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:11.567 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:12.073 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[D 2020-09-14 07:52:12.580 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:13.088 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:13.597 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO client.RMProxy: Connecting to ResourceManager at yarn-m/10.132.0.25:8032
[D 2020-09-14 07:52:14.117 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO client.AHSProxy: Connecting to Application History server at yarn-m/10.132.0.25:10200
20/09/14 07:52:14 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
[D 2020-09-14 07:52:14.625 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:14 INFO conf.Configuration: resource-types.xml not found
20/09/14 07:52:14 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/09/14 07:52:14 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
20/09/14 07:52:14 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
20/09/14 07:52:14 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
20/09/14 07:52:14 INFO yarn.Client: Will allocate AM container, with 2304 MB memory including 384 MB overhead
20/09/14 07:52:14 INFO yarn.Client: Setting up container launch context for our AM
20/09/14 07:52:14 INFO yarn.Client: Setting up the launch environment for our AM container
20/09/14 07:52:14 INFO yarn.Client: Preparing resources for our AM container
20/09/14 07:52:14 INFO yarn.Client: Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/spark_python_yarn_cluster#spark_python_yarn_cluster -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/spark_python_yarn_cluster
[D 2020-09-14 07:52:15.136 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:15.642 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:16.151 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:16.659 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:17.165 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:17.672 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:18.204 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:18.720 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:19.229 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:19.738 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:20.245 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:20.755 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:21.261 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:21.772 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:22.283 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:22.791 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/launch_ipykernel.py
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/pyspark.zip
20/09/14 07:52:22 INFO yarn.Client: Uploading resource file:/opt/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/py4j-0.10.7-src.zip
[D 2020-09-14 07:52:23.307 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:23.814 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:23 INFO yarn.Client: Uploading resource file:/hadoop/spark/tmp/spark-c39f9d45-c74c-45d8-8685-f03442796cd4/__spark_conf__5311454277149128232.zip -> hdfs://yarn-m/user/root/.sparkStaging/application_1600064376909_0001/__spark_conf__.zip
20/09/14 07:52:23 INFO spark.SecurityManager: Changing view acls to: root
20/09/14 07:52:23 INFO spark.SecurityManager: Changing modify acls to: root
20/09/14 07:52:23 INFO spark.SecurityManager: Changing view acls groups to:
20/09/14 07:52:23 INFO spark.SecurityManager: Changing modify acls groups to:
20/09/14 07:52:23 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
[D 2020-09-14 07:52:24.321 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:24.833 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:25.342 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
[D 2020-09-14 07:52:25.856 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' - retrying...
20/09/14 07:52:25 INFO yarn.Client: Submitting application application_1600064376909_0001 to ResourceManager
[I 2020-09-14 07:52:26.374 EnterpriseGatewayApp] ApplicationID: 'application_1600064376909_0001' assigned for KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', state: SUBMITTED, 20.0 seconds after starting.
[D 2020-09-14 07:52:26.379 EnterpriseGatewayApp] 39: State: 'SUBMITTED', Host: '', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
20/09/14 07:52:26 INFO impl.YarnClientImpl: Submitted application application_1600064376909_0001
20/09/14 07:52:26 INFO yarn.Client: Application report for application_1600064376909_0001 (state: ACCEPTED)
20/09/14 07:52:26 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1600069946170
final status: UNDEFINED
tracking URL: http://yarn-m:8088/proxy/application_1600064376909_0001/
user: root
20/09/14 07:52:26 INFO util.ShutdownHookManager: Shutdown hook called
20/09/14 07:52:26 INFO util.ShutdownHookManager: Deleting directory /hadoop/spark/tmp/spark-c39f9d45-c74c-45d8-8685-f03442796cd4
20/09/14 07:52:26 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5d60fe82-44b5-43af-9b93-3f5528dc9421
[D 2020-09-14 07:52:26.978 EnterpriseGatewayApp] 40: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:31.984 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:32.493 EnterpriseGatewayApp] 41: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:37.497 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:38.009 EnterpriseGatewayApp] 42: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:43.012 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[D 2020-09-14 07:52:43.522 EnterpriseGatewayApp] 43: State: 'ACCEPTED', Host: 'yarn-w-0.c.internal', KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356', ApplicationID: 'application_1600064376909_0001'
[D 2020-09-14 07:52:48.527 EnterpriseGatewayApp] Waiting for KernelID '4798307d-dc70-452e-9d0c-0c4e87cb3356' to send connection info from host 'yarn-w-0.c.internal' - retrying...
[W 2020-09-14 07:52:49.056 EnterpriseGatewayApp] Termination of application 'application_1600064376909_0001' failed with exception: 'Response finished with status: 405. Details: '. Continuing...
[D 2020-09-14 07:52:54.123 EnterpriseGatewayApp] YarnClusterProcessProxy.kill, application ID: application_1600064376909_0001, kernel ID: 4798307d-dc70-452e-9d0c-0c4e87cb3356, state: ACCEPTED, result: <coroutine object BaseProcessProxyABC.kill at 0x7f5aeeb863b0>
/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/yarn.py:329: RuntimeWarning: coroutine 'BaseProcessProxyABC.kill' was never awaited
self.kill()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
[D 2020-09-14 07:52:54.202 EnterpriseGatewayApp] response socket still open, close it
[E 2020-09-14 07:52:54.203 EnterpriseGatewayApp] KernelID: '4798307d-dc70-452e-9d0c-0c4e87cb3356' launch timeout due to: YARN resources unavailable after 43.0 seconds for app application_1600064376909_0001, launch timeout: 40.0! Check YARN configuration.
Hi @dnks23 - thanks for the information. Given that you see the kernel eventually reach RUNNING state, I think one of two things are going on:
10.132.0.25:39735
When a remote kernel is launched via EG, EG will start listening for a response for the kernel's connection information once it discovers (via the native resource manager's API) a host has been assigned. In the output above, this occurs at 07:52:26.978
. Here you see EG enter a 5-second wait cycle for that particular launch (the wait is asynchronous so simultaneous kernel start requests will be interleaved here as well) which, after the overall 40 second period has expired, it gives up.
If your YARN cluster is heavily loaded, this could be an issue and you should increase the kernel launch timeout. The kernel launch timeout is closely tied to the request timeout since kernel starts are essentially a POST request. If you're using Notebook 6.1+ both of these values are synchronized, so setting one appropriately adjusts the other. The kernel launch timeout can be configured on the client by setting the env KERNEL_LAUNCH_TIMEOUT
to the number of seconds you want to wait. I would suggest increasing it to 120 for now. The request timeout can be set either via an env (JUPYTER_GATEWAY_REQUEST_TIMEOUT
) or a config/command-line option (--GatewayClient.request_timeout
). Once configured, restart the notebook server on the client.
To determine if the response is the issue, you should probably take a look at the kernel launcher output, which should be available in the application's stdout log (check stderr as well, but I tend to find things in stdout). In this case, it would be associated with application_1600064376909_0001
. If it cannot send its connection information back to EG, there should be some indication there. We print the payload being returned and its encrypted form.
10.132.0.25
is an internal IP. Does the YARN cluster have access to that IP? Is this value being configured via EG_RESPONSE_IP
or is this the IP that EG determined is a local IP address?
Unfortunately, we associate a different port for each kernel launch, so the port number will vary every time (we're currently working on a single-response-address capability that should be available in EG 3.0) so another issue might be that this port isn't accessible from outside the EG container. So, you might want to look into using EG's port-range capabilities and configure EG_PORT_RANGE
or (--EnterpriseGatewayApp.port_range
) and then start the EG container with that range published via -p
.
Upon receipt of the kernel's connection information, the logs should show a kernel-info-request, followed by a kernel-info-reply and WebSocket establishment. Once that completes, you should be able to use your kernel from the notebook.
Hey @kevin-bates Thanks again for the detailed explanations and your pointers. I actually started with setting EG_KERNEL_LAUNCH_TIMEOUT=60
on the client-side - and that already did the trick! Kernels start up as expected and is available from the Client-Notebook now. Logs look fine now too!
Thanks for assisting in getting this basic EG-Docker-Yarn setup to work - the single response socket issue seems to be bypassed by using the host-network - although I guess (and like u already mentioned) a proper implementation in one of the next releases would be nice!
Awesome - glad to hear you're moving forward!
Do you mind summarizing what additional items you needed to configure? E.g., EG_RESPONSE_IP
, EG_PROHIBITED_LOCAL_IPS
, etc?
This may prove useful to others - thank you.
I did not need to set any other configuration. Only increasing EG_KERNEL_LAUNCH_TIMEOUT
and using the docker-host network did the trick. But I assume that all other items you mentioned may need attention and are worth to inspect in case someone else is dealing with issues when using this setup.
Description
Hi, I would like to create a basic docker-image that I can easily setup & replicate on any existing Yarn-Cluster. Idea would be to run the docker-container on the Yarn-Master with the Gateway targeting that Yarn-Master which is also the Docker-Host in this case. I've seen all the images that are provided in this repo and there are also some from the Jupyter stack. But those seem quite bloated as there are own installations of Hadoop/Spark/etc. in the image...
I wonder whether this is really necessary and what the minimal setup would look like to let the gateway running in the container target the already provided Hadoop/Spark from the Yarn-Master/Docker-Host.
Environment