ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.88k stars 5.76k forks source link

[Core] Raylet continually exiting on worker in docker #26576

Open mdagost opened 2 years ago

mdagost commented 2 years ago

What happened + What you expected to happen

I'm running a relatively simple ray tune script using ray's docker support. The head node and the worker nodes are using a custom container that I've built off of the rayproject/ray:latest image. I am seeing a very, very large number of errors only on the worker node of the form:

(raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

There isn't anything in the logs that is helpful in diagnosing.

I've tried to up the memory size of the workers, but that isn't helping. When I monitor in the dashboard, nothing appears close to any memory limits. And I've watched the running containers with docker stats and the container stays around 15% of the memory limit before it just dies. Again, this happens over and over again and seems to happen only on the worker.

Many of the configurations succeed when they run on the head node. And failing ones succeed when they are re-run on the head node. So it seems to be something about the worker specificially.

Versions / Dependencies

Containers based on rayproject/ray:latest docker image.

Reproduction script

Here is the cluster configuration:

cluster_name: ray-cluster
min_workers: 1
max_workers: 1

provider:
   type: aws
   region: us-east-1
   availability_zone: us-east-1b
   use_internal_ips: true
   cache_stopped_nodes: false

auth:
   ssh_user: ubuntu
   ssh_private_key: ***

docker:
   head_image: ***.dkr.ecr.us-east-1.amazonaws.com/train/train:latest-ray-cpu
   worker_image: ***.dkr.ecr.us-east-1.amazonaws.com/train/train:latest-ray-cpu
   container_name: ray_docker

available_node_types:
   ray_head:
      node_config:
         InstanceType: r5.xlarge
         ImageId: ami-077fb40eebcc23898
         KeyName: ***
         SecurityGroupIds: [***]
         SubnetIds: [subnet-***]
         BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                 VolumeSize: 200
      resources: {"CPU": 4}

   ray_cpu_worker:
      node_config:
         InstanceType: r5.xlarge 
         ImageId: ami-077fb40eebcc23898
         KeyName: ***
         SecurityGroupIds: [sg-***]
         SubnetIds: [subnet-***]
         BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                 VolumeSize: 200
         IamInstanceProfile:
            Arn: arn:aws:iam::115625852156:instance-profile/Ray_ECR_ReadWriteAccess
      resources: {"CPU": 4}
      min_workers: 1
      max_workers: 1

initialization_commands:
   - aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ***.dkr.ecr.us-east-1.amazonaws.com

head_node_type: ray_head

file_mounts: {
   "/data/": "/local/ray/data/"
}

file_mounts_sync_continuously: True

The script is like the following:

import data
import ray
import time

from ray import tune
from ray.tune.suggest.optuna import OptunaSearch

...

def train_vectors(config, dataset=None):
   import data
   import model
   import entity_vectors
   import random
   import time

   print("training vectors...")
   print(f"we have {len(dataset)} rows of data...") 
   score, _ = entity_vectors.train_vectors(dataset, model_config['vector_model'])
   tune.report(ndgc=score)

ray.init(address="auto")

print("loading vector data...")
dataset = data.load_vector_data(
   start_date=model_config['data']['start_date'],
   query_limit=model_config['data']['query_limit'],
   use_cache=args.use_cache,
   data_path=args.pageview_data_path
)

params = {
   "min_count": tune.randint(1, 50),
   "window": tune.randint(5, 50),
   "epochs": tune.randint(5, 50),
   "negative": tune.randint(5, 50),
   "ns_exponent": tune.uniform(-0.99, 0.99),
   "sample": tune.uniform(1e-05, 0.1),
   "vector_size": tune.randint(50, 350),
   "workers": 1
}

print("starting to tune...")
start = time.time()
analysis = tune.run(
    tune.with_parameters(train_vectors, dataset=dataset),
    config=params,
    metric="ndgc",
    mode="max",
    search_alg=OptunaSearch(),
    num_samples=20,
    local_dir="/data/ray_results",
    max_failures=10,
    resources_per_trial={"cpu": 4, "gpu": 0})
taken = time.time() - start
print(f"Time taken: {taken:.2f} seconds.")
print(f"Best config: {analysis.best_config}")

Issue Severity

High: It blocks me from completing my task.

mdagost commented 2 years ago

For posterity, I was able to track down that this indeed must be docker related. I launched a new cluster with the same code and dependencies but not running inside docker. The cluster yaml is below. And I have zero raylet failures running in that mode.

cluster_name: ray-cluster
min_workers: 1
max_workers: 1

provider:
   type: aws
   region: us-east-1
   availability_zone: us-east-1b
   use_internal_ips: true
   cache_stopped_nodes: false

auth:
   ssh_user: ubuntu
   ssh_private_key: /path/to/.ssh/****.pem

available_node_types:
   ray_head:
      node_config:
         InstanceType: m5.2xlarge
         ImageId: ami-077fb40eebcc23898
         KeyName: ****
         SecurityGroupIds: [sg-****]
         SubnetIds: [subnet-****]
         BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                 VolumeSize: 200
         IamInstanceProfile:
            Arn: arn:aws:iam::****:instance-profile/ray-head-node
      resources: {"CPU": 8}

   ray_cpu_worker:
      node_config:
         InstanceType: m5.2xlarge
         ImageId: ami-077fb40eebcc23898
         KeyName: ****
         SecurityGroupIds: [sg-****]
         SubnetIds: [subnet-****]
         BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                 VolumeSize: 200
         IamInstanceProfile:
            Arn: arn:aws:iam::****:instance-profile/ray-worker-node
      resources: {"CPU": 8}
      min_workers: 1
      max_workers: 1

setup_commands:
   - mkdir -p /home/ubuntu/app/data
   - "if [ ! -f /home/ubuntu/app/data/dataset.csv ]; then aws s3 cp s3://****/dataset.csv /home/ubuntu/app/data/; fi"
   - "if [ ! -d /home/ubuntu/.pyenv ]; then curl https://pyenv.run | bash; fi"
   - echo export PYENV_ROOT="$HOME/.pyenv" >> /home/ubuntu/.bashrc
   - echo 'command -v pyenv >/dev/null || export PATH=$PYENV_ROOT/bin:$PATH' >> /home/ubuntu/.bashrc
   - echo 'eval "$(pyenv init -)"' >> /home/ubuntu/.bashrc
   - echo 'eval "$(pyenv virtualenv-init -)"' >> /home/ubuntu/.bashrc
   - sudo flock /var/lib/apt/daily_lock apt-get install -y libreadline-dev libbz2-dev libssl-dev sqlite3 libsqlite3-dev libffi-dev
   - "if [ ! -d /home/ubuntu/.pyenv/versions/3.7.7 ]; then pyenv install 3.7.7; fi"
   - echo export PATH=/home/ubuntu/.pyenv/versions/3.7.7/bin:\$PATH >> /home/ubuntu/.bashrc
   - cp /home/ubuntu/app/data.py /home/ubuntu
   - cp /home/ubuntu/app/model.py /home/ubuntu
   - cp /home/ubuntu/app/entity_vectors.py /home/ubuntu
   - pip install --upgrade pip
   - pip install -r /home/ubuntu/app/requirements.ray.cpu.txt
   - pip install ray ray[tune]

head_node_type: ray_head

file_mounts: {
   "/home/ubuntu/app/": "/path/to/code"
}

file_mounts_sync_continuously: True