Open mdagost opened 2 years ago
For posterity, I was able to track down that this indeed must be docker related. I launched a new cluster with the same code and dependencies but not running inside docker. The cluster yaml is below. And I have zero raylet failures running in that mode.
cluster_name: ray-cluster
min_workers: 1
max_workers: 1
provider:
type: aws
region: us-east-1
availability_zone: us-east-1b
use_internal_ips: true
cache_stopped_nodes: false
auth:
ssh_user: ubuntu
ssh_private_key: /path/to/.ssh/****.pem
available_node_types:
ray_head:
node_config:
InstanceType: m5.2xlarge
ImageId: ami-077fb40eebcc23898
KeyName: ****
SecurityGroupIds: [sg-****]
SubnetIds: [subnet-****]
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
IamInstanceProfile:
Arn: arn:aws:iam::****:instance-profile/ray-head-node
resources: {"CPU": 8}
ray_cpu_worker:
node_config:
InstanceType: m5.2xlarge
ImageId: ami-077fb40eebcc23898
KeyName: ****
SecurityGroupIds: [sg-****]
SubnetIds: [subnet-****]
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
IamInstanceProfile:
Arn: arn:aws:iam::****:instance-profile/ray-worker-node
resources: {"CPU": 8}
min_workers: 1
max_workers: 1
setup_commands:
- mkdir -p /home/ubuntu/app/data
- "if [ ! -f /home/ubuntu/app/data/dataset.csv ]; then aws s3 cp s3://****/dataset.csv /home/ubuntu/app/data/; fi"
- "if [ ! -d /home/ubuntu/.pyenv ]; then curl https://pyenv.run | bash; fi"
- echo export PYENV_ROOT="$HOME/.pyenv" >> /home/ubuntu/.bashrc
- echo 'command -v pyenv >/dev/null || export PATH=$PYENV_ROOT/bin:$PATH' >> /home/ubuntu/.bashrc
- echo 'eval "$(pyenv init -)"' >> /home/ubuntu/.bashrc
- echo 'eval "$(pyenv virtualenv-init -)"' >> /home/ubuntu/.bashrc
- sudo flock /var/lib/apt/daily_lock apt-get install -y libreadline-dev libbz2-dev libssl-dev sqlite3 libsqlite3-dev libffi-dev
- "if [ ! -d /home/ubuntu/.pyenv/versions/3.7.7 ]; then pyenv install 3.7.7; fi"
- echo export PATH=/home/ubuntu/.pyenv/versions/3.7.7/bin:\$PATH >> /home/ubuntu/.bashrc
- cp /home/ubuntu/app/data.py /home/ubuntu
- cp /home/ubuntu/app/model.py /home/ubuntu
- cp /home/ubuntu/app/entity_vectors.py /home/ubuntu
- pip install --upgrade pip
- pip install -r /home/ubuntu/app/requirements.ray.cpu.txt
- pip install ray ray[tune]
head_node_type: ray_head
file_mounts: {
"/home/ubuntu/app/": "/path/to/code"
}
file_mounts_sync_continuously: True
What happened + What you expected to happen
I'm running a relatively simple ray tune script using ray's docker support. The head node and the worker nodes are using a custom container that I've built off of the
rayproject/ray:latest
image. I am seeing a very, very large number of errors only on the worker node of the form:There isn't anything in the logs that is helpful in diagnosing.
I've tried to up the memory size of the workers, but that isn't helping. When I monitor in the dashboard, nothing appears close to any memory limits. And I've watched the running containers with
docker stats
and the container stays around 15% of the memory limit before it just dies. Again, this happens over and over again and seems to happen only on the worker.Many of the configurations succeed when they run on the head node. And failing ones succeed when they are re-run on the head node. So it seems to be something about the worker specificially.
Versions / Dependencies
Containers based on
rayproject/ray:latest
docker image.Reproduction script
Here is the cluster configuration:
The script is like the following:
Issue Severity
High: It blocks me from completing my task.