Closed roireshef closed 4 years ago
I tried reproducing that on latest wheel https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
This time I ran:
ray start --head
on the head node
and the ray start command as appeared in the head's output when starting ray.
Dashboard shows both nodes as connected and their CPU resources. Nevertheless, when trying to schedule the above f() method, it only runs on head, and node's workers are idle. I also tried heavier load with some numpy calculations. While head is loaded for few minutes, the node is sitting idle. The python interpreter on the head throws the following message:
E0220 14:37:39.706605 4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 14:37:39.706764 4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 14:37:39.706926 4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
Some more information. It seems because the worker node runs inside a Docker container, it sends the head node its Docker internal IP rather than the machine's (external to docker) IP. My current assumption is that this is the cause.
Following https://github.com/ray-project/ray/issues/5442#issuecomment-523229964 I tried to use --node-ip-address flag at the worker node, without any further success. When connecting to the head node and running ray.nodes() (as well as in the webui), it shows the worker node with its Docker-internal IP.
OK, I now managed to supply the worker node with its own external IP and see it inside the head node's resources list. This didn't solve the problem though. I'm still getting:
...
E0220 18:04:58.267797 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269181 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269912 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
OK, I now managed to supply the worker node with its own external IP and see it inside the head node's resources list. This didn't solve the problem though. I'm still getting:
... E0220 18:04:58.267797 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses E0220 18:04:58.269181 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses E0220 18:04:58.269912 1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
I have same problem but then I figure it out by open the firewall on the slave nodes.
I also have the same problem ...
Does anyone resolved it yet, facing same issue.
@deepankar27 @wangzelong0663 It is probably related to networking. Try disabling your firewall, work around your proxy and test for open connection using common communication tools like telnet, etc.
-> In a private cluster setup <-
It started with Tune not using nodes other than the head, then we took few steps back and tried to run lower level ray snippets, to identify the issue. We ended up getting:
Which seems related to #5223 but it is unfortunately closed without resolution so I'm opening this instead.
Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I'm running on two local machines with Ubuntu 16.04 and inside a Docker container.
Steps followed:
ray start --head --redis-port=6379 --redis-shard-ports=6380 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Driver": 1.0}' --num-cpus=0
on the headray start --redis-address=<IP_OF_HEAD>:6379 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Node": 1.0}'
on the noderay.init(redis_address="(10.67.0.201):6379")
ray.cluster_resources()
[ {'ClientID': 'e98dd25ed961a708adf90566c20abd2e77b4deb5', 'EntryType': 0, 'NodeManagerAddress': '10.67.0.201', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'GPU': 1.0, 'Driver': 1.0}},
{'ClientID': 'd38de4071fd540f93c2dd531915c327ef877ed8b', 'EntryType': 1, 'NodeManagerAddress': '10.67.0.163', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'CPU': 12.0, 'GPU': 1.0, 'Node': 1.0}}]
@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address()
Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(100)]))
... 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 62a3c929e6460dfe47fe94ce0acb682b is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 7f4d78d43b607006628488e069753533 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 82e8b3db13d57242e0b7112954d04c83 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID c2668d440194bc623a1b95ab848731a1 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.