ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[Task distribution] My tasks doesn't seem to distribute to all my workers #33802

Open maurhub opened 1 year ago

maurhub commented 1 year ago

What happened + What you expected to happen

Hey guys!

I'm starting to use Ray over AWS, did a super quick test, configured my YML to use 6 nodes, but somehow my tasks are not being distributed to all nodes, but just 2.

Based on this output, it does make sense my workload is not distributed, but I properly see my 6 EC2 instances in AWS. MicrosoftTeams-image

I'm attaching my .py and .yml files:

# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2

min_workers: 3
max_workers: 6

available_node_types:
  ray.head.default:
      node_config:
        InstanceType: m5.large
        BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 140
      resources: {"CPU": 2}
  ray.worker.default:
      node_config:

        InstanceType: m5.large
        InstanceMarketOptions:
            MarketType: spot
      resources: {"CPU": 2}
      min_workers: 0

# setup_commands:
#     - pip install backtrader==1.9.76.123
#     - pip install hyperopt==0.2.7
#     - pip install numpy
#     - pip install pandas
#     - pip install qgrid==1.3.1
#     - pip install scipy
#     - pip install vectorbt==0.24.5
#     - pip install yfinance

Versions / Dependencies

2.3.0

Reproduction script

from collections import Counter
import socket
import time

import ray

ray.init()

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

Issue Severity

Medium: It is a significant difficulty but I can work around it.

zhe-thoughts commented 1 year ago

Thanks for reporting this @maurhub ! We will try to reproduce and resolve

maurhub commented 1 year ago

One thing I've noticed is if I test with Ray 3.0 in a Linux environment it creates the nodes as expected, but in a WSL environment it says it will create the nodes, but they never end appearing in the dashboard. (AWS EC2 instances are created though)

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.