[BUG] creation of the cluster occasionally fails when deploying on yarn

smartguo commented 3 years ago

Describe the bug when deploying on yarn, the creation of the cluster will occasionally fail. Even if the cluster is successfully created, the following tasks such as some operations on dataframe will also fail and raise a similar error wtih cluster destroyed.

To Reproduce To help us reproducing this bug, please provide information below:

Your Python version: 3.7.9
The version of Mars you use: pymars[distributed]==0.7.0rc2
Versions of crucial packages, such as numpy, scipy and protobuf: numpy==1.19.4, scipy==1.5.4, protobuf==3.14.0, pyarrow==2.0.0

Minimized code and full stack of the error.


import os
from mars.deploy.yarn import new_cluster
import mars.tensor as mt

os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_191-amd64' os.environ['HADOOP_HOME'] = "/usr/bin/hadoop" os.environ['ARROW_LIBHDFS_DIR'] = "/opt/cloudera/parcels/CDH/lib64/" cluster = new_cluster( environment='python:///opt/anaconda3/envs/pymodel/bin/python', supervisor_num=1, supervisor_cpu=1, supervisor_mem='4g', web_num=1, app_name="mars-yarn-test", worker_num=4, worker_cpu=8, worker_mem='16g', min_worker_num=2, timeout=6000, supervisor_extra_args='--log-level DEBUG', worker_extra_env={ "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", "MARS_USE_PROCESS_STAT": "1", 'HADOOP_HOME': "/opt/cloudera/parcels/CDH/lib/hadoop/" }, supervisor_extra_env={ "MARS_USE_PROCESS_STAT": "1", "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/" }, worker_cache_mem='3g') print(cluster.endpoint) a = mt.random.rand(10, 10) print(a.dot(a.T).execute())

```bash
21/07/10 23:38:34 INFO client.RMProxy: Connecting to ResourceManager at 172.16.1.139/172.16.1.139:8032
21/07/10 23:38:35 INFO skein.Driver: Driver started, listening on 53602
21/07/10 23:38:35 INFO hdfs.DFSClient: Created token for smartguo: HDFS_DELEGATION_TOKEN owner=smartguo@HADOOP.COM, renewer=yarn, realUser=, issueDate=1625931515729, maxDate=1783611515729, sequenceNumber=133636, masterKeyId=838 on 172.16.1.139:8020
21/07/10 23:38:35 INFO security.TokenCache: Got dt for hdfs://172.16.1.139:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 172.16.1.139:8020, Ident: (token for smartguo: HDFS_DELEGATION_TOKEN owner=smartguo@HADOOP.COM, renewer=yarn, realUser=, issueDate=1625931515729, maxDate=1783611515729, sequenceNumber=133636, masterKeyId=838)
21/07/10 23:38:35 INFO skein.Driver: Uploading application resources to hdfs://172.16.1.139:8020/user/smartguo/.skein/application_1614664426291_1326
21/07/10 23:38:36 INFO skein.Driver: Submitting application...
21/07/10 23:38:36 INFO impl.YarnClientImpl: Submitted application application_1614664426291_1326
21/07/10 23:38:56 INFO client.RMProxy: Connecting to ResourceManager at 172.16.1.139/172.16.1.139:8032
21/07/10 23:38:56 INFO skein.Driver: Driver started, listening on 59098
21/07/10 23:38:57 INFO impl.YarnClientImpl: Killed application application_1614664426291_1326
Traceback (most recent call last):
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)  # type: ignore  # noqa
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/asyncio/base_events.py", line 962, in create_connection
    raise exceptions[0]
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/asyncio/base_events.py", line 949, in create_connection
    await self.sock_connect(sock, address)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/asyncio/selector_events.py", line 473, in sock_connect
    return await fut
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/asyncio/selector_events.py", line 503, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('0.0.0.0', 54712)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_yarn.py", line 35, in <module>
    worker_cache_mem='3g')
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 143, in new_cluster
    is_client_managed=is_client_managed)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 35, in __init__
    self._session = new_session(endpoint).as_default()
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1411, in new_session
    backend=backend, new=True, **kwargs)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1133, in init
    isolated_session = fut.result()
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 635, in init
    return await _IsolatedWebSession._init(address, session_id, new=new)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 913, in _init
    await session_api.create_session(session_id)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/session/api/web.py", line 93, in create_session
    res = await self._request_url(path=addr, method='PUT', data=b'')
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/web/core.py", line 186, in _request_url
    res = await self._client.request(method, path, **kwargs)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/client.py", line 521, in _request
    req, traces=traces, timeout=real_timeout
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 535, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 892, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 1051, in _create_direct_connection
    raise last_exc
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 1032, in _create_direct_connection
    client_error=client_error,
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
    raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 0.0.0.0:54712 ssl:default [Connect call failed ('0.0.0.0', 54712)]
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f01a13bdbd0>

Expected behavior Create the cluster Successfully.

smartguo commented 3 years ago

with version 0.8.0a2, the code mentioned above will raise error like this:

Traceback (most recent call last):
  File "test_yarn.py", line 37, in <module>
    worker_cache_mem='3g')
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 143, in new_cluster
    is_client_managed=is_client_managed)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 35, in __init__
    self._session = new_session(endpoint)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1475, in new_session
    backend=backend, new=True, **kwargs)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1175, in init
    isolated_session = fut.result()
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 657, in init
    return await _IsolatedWebSession._init(address, session_id, new=new, timeout=timeout)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 945, in _init
    await session_api.create_session(session_id)
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/session/api/web.py", line 135, in create_session
    res = await self._request_url(path=addr, method='PUT', data=b'')
  File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/web/core.py", line 176, in _request_url
    res = await self._client.fetch(path, method=method, raise_error=False, **kwargs)
ConnectionRefusedError: [Errno 111] Connection refused

It is weird that this error occured after serveral successful cluster creation, and then it fails every time.

qinxuye commented 3 years ago

What's your Hadoop version? We will try to reproduce the failure within this week.

smartguo commented 3 years ago

What's your Hadoop version? We will try to reproduce the failure within this week.

Hadoop 2.6.0-cdh5.13.0

mars-project / mars

[BUG] creation of the cluster occasionally fails when deploying on yarn #2209