ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

Failed to forward task to node manager [ray] #7242

Closed roireshef closed 4 years ago

roireshef commented 4 years ago

-> In a private cluster setup <-

It started with Tune not using nodes other than the head, then we took few steps back and tried to run lower level ray snippets, to identify the issue. We ended up getting:

I0220 14:08:16.620895  3799 node_manager.cc:2149] Failed to forward task ff237343da81f0ef334299ed90993428 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620929  3799 node_manager.cc:2149] Failed to forward task 60945591c66e8a468153e975ec34745b to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620961  3799 node_manager.cc:2149] Failed to forward task df3e50eb1c1503bb6f7446700473889e to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620995  3799 node_manager.cc:2149] Failed to forward task 21e58caf15d4be68b02b94621b96e6b5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.621026  3799 node_manager.cc:2149] Failed to forward task 28cfb580f25d19ca87f3c56aa7bb1ed5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b

Which seems related to #5223 but it is unfortunately closed without resolution so I'm opening this instead.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

I'm running on two local machines with Ubuntu 16.04 and inside a Docker container.

Steps followed:

  1. Disabled any firewall
  2. Opened all ports for the Docker container
  3. Ran: ray start --head --redis-port=6379 --redis-shard-ports=6380 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Driver": 1.0}' --num-cpus=0 on the head ray start --redis-address=<IP_OF_HEAD>:6379 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Node": 1.0}' on the node
  4. Verified (with telnet) all ports above are accessible from master->node AND node->master
  5. Inside the head's python, I ran:
    
    import ray
    import time

ray.init(redis_address="(10.67.0.201):6379")

ray.cluster_resources()


gives:
`{'GPU': 2.0, 'Driver': 1.0, 'CPU': 12.0, 'Node': 1.0}`

`ray.nodes()`
gives:

[ {'ClientID': 'e98dd25ed961a708adf90566c20abd2e77b4deb5', 'EntryType': 0, 'NodeManagerAddress': '10.67.0.201', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'GPU': 1.0, 'Driver': 1.0}},

{'ClientID': 'd38de4071fd540f93c2dd531915c327ef877ed8b', 'EntryType': 1, 'NodeManagerAddress': '10.67.0.163', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'CPU': 12.0, 'GPU': 1.0, 'Node': 1.0}}]


and 

@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address()

Get a list of the IP addresses of the nodes that have joined the cluster.

set(ray.get([f.remote() for _ in range(100)]))


Just hangs for a while, until I get:

... 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 62a3c929e6460dfe47fe94ce0acb682b is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 7f4d78d43b607006628488e069753533 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 82e8b3db13d57242e0b7112954d04c83 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources. 2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID c2668d440194bc623a1b95ab848731a1 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.



If we cannot run your script, we cannot fix your issue.

- [x] I have verified my script runs in a clean environment and reproduces the issue.
- [x] I have verified the issue also occurs with the [latest wheels](https://ray.readthedocs.io/en/latest/installation.html).
roireshef commented 4 years ago

I tried reproducing that on latest wheel https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl

This time I ran: ray start --head on the head node and the ray start command as appeared in the head's output when starting ray.

Dashboard shows both nodes as connected and their CPU resources. Nevertheless, when trying to schedule the above f() method, it only runs on head, and node's workers are idle. I also tried heavier load with some numpy calculations. While head is loaded for few minutes, the node is sitting idle. The python interpreter on the head throws the following message:

E0220 14:37:39.706605  4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 14:37:39.706764  4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 14:37:39.706926  4297 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
roireshef commented 4 years ago

Some more information. It seems because the worker node runs inside a Docker container, it sends the head node its Docker internal IP rather than the machine's (external to docker) IP. My current assumption is that this is the cause.

Following https://github.com/ray-project/ray/issues/5442#issuecomment-523229964 I tried to use --node-ip-address flag at the worker node, without any further success. When connecting to the head node and running ray.nodes() (as well as in the webui), it shows the worker node with its Docker-internal IP.

roireshef commented 4 years ago

OK, I now managed to supply the worker node with its own external IP and see it inside the head node's resources list. This didn't solve the problem though. I'm still getting:

...
E0220 18:04:58.267797  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269181  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269912  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
swicaksono commented 4 years ago

OK, I now managed to supply the worker node with its own external IP and see it inside the head node's resources list. This didn't solve the problem though. I'm still getting:

...
E0220 18:04:58.267797  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269181  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269912  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses

I have same problem but then I figure it out by open the firewall on the slave nodes.

wangzelong0663 commented 4 years ago

I also have the same problem ...

deepankar27 commented 3 years ago

Does anyone resolved it yet, facing same issue.

roireshef commented 3 years ago

@deepankar27 @wangzelong0663 It is probably related to networking. Try disabling your firewall, work around your proxy and test for open connection using common communication tools like telnet, etc.