ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.38k stars 5.66k forks source link

Ray Core: Worker that was launched by an autoscaling_config loops and crashes #31192

Closed Ericxgao closed 1 year ago

Ericxgao commented 1 year ago

What happened + What you expected to happen

I have the following deploy yaml:

# A unique identifier for the head node and workers of this cluster.
cluster_name: default

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "rayproject/ray:latest-py39-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_latest"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --gpus 1
        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: "129.146.99.48"
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: ["129.146.162.248"]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ./kaiber_private
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
max_workers: 1
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
   "~/deforum-stable-diffusion": "~/deforum-stable-diffusion",
   "~/stable-diffusion-webui": "~/stable-diffusion-webui"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    - sudo usermod -aG docker $USER; sleep 10
    # - wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | I_AGREE_TO_THE_CUDNN_LICENSE=1 sh -

# List of shell commands to run to set up each nodes.
setup_commands: 
    - ls
    - conda install -y -c conda-forge ffmpeg
    - cd deforum-stable-diffusion; pip install -r requirements.txt
    - sudo apt-get update
    - sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6
    - cd stable-diffusion-webui/models/Stable-diffusion; wget -nc https://oksami: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@huggingface.co/prompthero/openjourney/resolve/main/mdjrny-v4.ckpt
    - cd stable-diffusion-webui/models/Stable-diffusion; wget -N https://oksami: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@huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt -O mdjrny-v4.vae.pt
    - cd stable-diffusion-webui; mkdir -p embeddings;
    - cd deforum-stable-diffusion; mkdir -p init_images;
    - cd deforum-stable-diffusion; mkdir -p output;
    - cd deforum-stable-diffusion; pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 --force-reinstall
    - cd deforum-stable-diffusion; pip install triton==2.0.0.dev20220701; wget -nc https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl
    - cd deforum-stable-diffusion; pip install ./xformers-0.0.13.dev0-py3-none-any.whl
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: [
    'cd stable-diffusion-webui; nohup python launch.py --api &'
]

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --include-dashboard=true
    - cd deforum-stable-diffusion; pkill gunicorn; nohup gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 & 
    - cd stable-diffusion-webui; nohup python launch.py --api & 

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

When I run this, or if I manually start the head with the autoscaling_config, the worker will continue to crash loop with the raylet error:

*** SIGSEGV received at time=1671481698 on cpu 0 ***
PC: @     0x559d299b6bcf  (unknown)  boost::asio::detail::epoll_reactor::start_op()
    @     0x7fa2ef391420       3184  (unknown)
    @     0x559d299b85b3         64  boost::asio::detail::reactive_socket_service_base::start_accept_op()
    @     0x559d2930ed60        112  plasma::PlasmaStore::DoAccept()
    @     0x559d2930eaca        368  boost::asio::detail::reactive_socket_accept_op<>::do_complete()
    @     0x559d299babcb        128  boost::asio::detail::scheduler::do_run_one()
    @     0x559d299bc391        192  boost::asio::detail::scheduler::run()
    @     0x559d299bc5c0         64  boost::asio::io_context::run()
    @     0x559d2930b218        384  plasma::PlasmaStoreRunner::Start()
    @     0x559d292a6375        208  std::thread::_State_impl<>::_M_run()
    @     0x559d29a0b2d0  (unknown)  execute_native_thread_routine
    @ ... and at least 3 more frames
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361: *** SIGSEGV received at time=1671481698 on cpu 0 ***
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361: PC: @     0x559d299b6bcf  (unknown)  boost::asio::detail::epoll_reactor::start_op()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x7fa2ef391420       3184  (unknown)
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d299b85b3         64  boost::asio::detail::reactive_socket_service_base::start_accept_op()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d2930ed60        112  plasma::PlasmaStore::DoAccept()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d2930eaca        368  boost::asio::detail::reactive_socket_accept_op<>::do_complete()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d299babcb        128  boost::asio::detail::scheduler::do_run_one()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d299bc391        192  boost::asio::detail::scheduler::run()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d299bc5c0         64  boost::asio::io_context::run()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d2930b218        384  plasma::PlasmaStoreRunner::Start()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d292a6375        208  std::thread::_State_impl<>::_M_run()
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @     0x559d29a0b2d0  (unknown)  execute_native_thread_routine
[2022-12-19 12:28:18,761 E 5877 5901] (raylet) logging.cc:361:     @ ... and at least 3 more frames

Everything works if I manually start the servers myself and use ray start without specifying an autoscaling config - however, this is quite tedious if I want to scale up to more servers.

Versions / Dependencies

Ray 2.1.0 Python 3.9.12 Ubuntu 20.04

Reproduction script

ray bootstrap config:

{"cluster_name": "default", "auth": {"ssh_user": "ubuntu", "ssh_private_key": "~/ray_bootstrap_key.pem"}, "upscaling_speed": 1.0, "idle_timeout_minutes": 5, "docker": {"image": "rayproject/ray:latest-py39-gpu", "container_name": "ray_latest", "pull_before_run": true, "run_options": ["--gpus 1", "--ulimit nofile=65536:65536"]}, "initialization_commands": ["sudo usermod -aG docker $USER; sleep 10"], "setup_commands": ["ls", "conda install -y -c conda-forge ffmpeg", "cd deforum-stable-diffusion; pip install -r requirements.txt", "sudo apt-get update", "sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6", "cd stable-diffusion-webui/models/Stable-diffusion; wget -nc https://oksami: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@huggingface.co/prompthero/openjourney/resolve/main/mdjrny-v4.ckpt", "cd stable-diffusion-webui/models/Stable-diffusion; wget -N https://oksami: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@huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt -O mdjrny-v4.vae.pt", "cd stable-diffusion-webui; mkdir -p embeddings;", "cd deforum-stable-diffusion; mkdir -p init_images;", "cd deforum-stable-diffusion; mkdir -p output;", "cd deforum-stable-diffusion; pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 --force-reinstall", "cd deforum-stable-diffusion; pip install triton==2.0.0.dev20220701; wget -nc https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl", "cd deforum-stable-diffusion; pip install ./xformers-0.0.13.dev0-py3-none-any.whl"], "head_setup_commands": ["ls", "conda install -y -c conda-forge ffmpeg", "cd deforum-stable-diffusion; pip install -r requirements.txt", "sudo apt-get update", "sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6", "cd stable-diffusion-webui/models/Stable-diffusion; wget -nc https://oksami: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@huggingface.co/prompthero/openjourney/resolve/main/mdjrny-v4.ckpt", "cd stable-diffusion-webui/models/Stable-diffusion; wget -N https://oksami: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@huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt -O mdjrny-v4.vae.pt", "cd stable-diffusion-webui; mkdir -p embeddings;", "cd deforum-stable-diffusion; mkdir -p init_images;", "cd deforum-stable-diffusion; mkdir -p output;", "cd deforum-stable-diffusion; pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 --force-reinstall", "cd deforum-stable-diffusion; pip install triton==2.0.0.dev20220701; wget -nc https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl", "cd deforum-stable-diffusion; pip install ./xformers-0.0.13.dev0-py3-none-any.whl"], "worker_setup_commands": ["ls", "conda install -y -c conda-forge ffmpeg", "cd deforum-stable-diffusion; pip install -r requirements.txt", "sudo apt-get update", "sudo apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6", "cd stable-diffusion-webui/models/Stable-diffusion; wget -nc https://oksami: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@huggingface.co/prompthero/openjourney/resolve/main/mdjrny-v4.ckpt", "cd stable-diffusion-webui/models/Stable-diffusion; wget -N https://oksami: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@huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt -O mdjrny-v4.vae.pt", "cd stable-diffusion-webui; mkdir -p embeddings;", "cd deforum-stable-diffusion; mkdir -p init_images;", "cd deforum-stable-diffusion; mkdir -p output;", "cd deforum-stable-diffusion; pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 --force-reinstall", "cd deforum-stable-diffusion; pip install triton==2.0.0.dev20220701; wget -nc https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl", "cd deforum-stable-diffusion; pip install ./xformers-0.0.13.dev0-py3-none-any.whl", "cd stable-diffusion-webui; nohup python launch.py --api &"], "head_start_ray_commands": ["ray stop", "ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --include-dashboard=true", "cd deforum-stable-diffusion; pkill gunicorn; nohup gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 &", "cd stable-diffusion-webui; nohup python launch.py --api &"], "worker_start_ray_commands": ["ray stop", "ray start --address=$RAY_HEAD_IP:6379"], "file_mounts": {"~/deforum-stable-diffusion": "~/deforum-stable-diffusion", "~/stable-diffusion-webui": "~/stable-diffusion-webui"}, "cluster_synced_files": [], "file_mounts_sync_continuously": false, "rsync_exclude": ["**/.git", "**/.git/**"], "rsync_filter": [".gitignore"], "provider": {"type": "local", "head_ip": "129.146.99.48", "worker_ips": ["129.146.162.248"]}, "max_workers": 1, "available_node_types": {"local.cluster.node": {"node_config": {}, "resources": {}, "min_workers": 1, "max_workers": 1}}, "head_node_type": "local.cluster.node", "no_restart": false}

Issue Severity

High: It blocks me from completing my task.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 1 year ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!