Open smartguo opened 3 years ago
with version 0.8.0a2
, the code mentioned above will raise error like this:
Traceback (most recent call last):
File "test_yarn.py", line 37, in <module>
worker_cache_mem='3g')
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 143, in new_cluster
is_client_managed=is_client_managed)
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/yarn/client.py", line 35, in __init__
self._session = new_session(endpoint)
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1475, in new_session
backend=backend, new=True, **kwargs)
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 1175, in init
isolated_session = fut.result()
File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/anaconda3/envs/pymodel/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 657, in init
return await _IsolatedWebSession._init(address, session_id, new=new, timeout=timeout)
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/deploy/oscar/session.py", line 945, in _init
await session_api.create_session(session_id)
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/session/api/web.py", line 135, in create_session
res = await self._request_url(path=addr, method='PUT', data=b'')
File "/opt/anaconda3/envs/pymodel/lib/python3.7/site-packages/mars/services/web/core.py", line 176, in _request_url
res = await self._client.fetch(path, method=method, raise_error=False, **kwargs)
ConnectionRefusedError: [Errno 111] Connection refused
It is weird that this error occured after serveral successful cluster creation, and then it fails every time.
What's your Hadoop version? We will try to reproduce the failure within this week.
What's your Hadoop version? We will try to reproduce the failure within this week.
Hadoop 2.6.0-cdh5.13.0
Describe the bug when deploying on yarn, the creation of the cluster will occasionally fail. Even if the cluster is successfully created, the following tasks such as some operations on dataframe will also fail and raise a similar error wtih cluster destroyed.
To Reproduce To help us reproducing this bug, please provide information below:
os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_191-amd64' os.environ['HADOOP_HOME'] = "/usr/bin/hadoop" os.environ['ARROW_LIBHDFS_DIR'] = "/opt/cloudera/parcels/CDH/lib64/" cluster = new_cluster( environment='python:///opt/anaconda3/envs/pymodel/bin/python', supervisor_num=1, supervisor_cpu=1, supervisor_mem='4g', web_num=1, app_name="mars-yarn-test", worker_num=4, worker_cpu=8, worker_mem='16g', min_worker_num=2, timeout=6000, supervisor_extra_args='--log-level DEBUG', worker_extra_env={ "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/", "MARS_USE_PROCESS_STAT": "1", 'HADOOP_HOME': "/opt/cloudera/parcels/CDH/lib/hadoop/" }, supervisor_extra_env={ "MARS_USE_PROCESS_STAT": "1", "ARROW_LIBHDFS_DIR": "/opt/cloudera/parcels/CDH/lib64/" }, worker_cache_mem='3g') print(cluster.endpoint) a = mt.random.rand(10, 10) print(a.dot(a.T).execute())
Expected behavior Create the cluster Successfully.