ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 89 forks source link

Ray dashboard doesn't connect with image: "anyscale/aviary:latest" #23

Closed mahaddad closed 1 year ago

mahaddad commented 1 year ago

Thanks for the great project!

With the exact same setup and steps, Ray dashboard connects with image:anyscale/aviary:0.1.0-a98a94c5005525545b9ea0a5b0b7b22f25f322d7-tgi

Happy to provide logs if you let me know which ones would be helpful and how to generate them.

Appreciate you guys taking a look at this.

Yard1 commented 1 year ago

Thanks, we'll look into that! Is this a problem with both images or just the tgi image?

mahaddad commented 1 year ago

I have only tried anyscale/aviary:latest.

I also tried anyscale/aviary:0.0.2-ac62571102ddd7d588da27c2aaff6e0454af8c61 and that worked.

Yard1 commented 1 year ago

Thanks, we should have a fix shortly.

mahaddad commented 1 year ago

Thanks, @Yard1 ! Assuming the fix is a new docker image, how can I know automatically once the aviary:latest has been updated? Also happy to test and let you know if it's working on my end.

Yard1 commented 1 year ago

@mahaddad we'll make a new release in the github repo!

Yard1 commented 1 year ago

@mahaddad I've uploaded a test image, anyscale/aviary:test. Can you see if that works for you?

mahaddad commented 1 year ago

Ray dashboard is now working as expected. Thank you!

However, when I run aviary run --model ./models/static_batching/mosaicml--mpt-7b-instruct.yaml the model never successfully deploys. I was able to get this model to deploy on the other image I mentioned in my original post.

mahaddad commented 1 year ago
RAY-SERVE-debug-screencap
Yard1 commented 1 year ago

Will take a look

Yard1 commented 1 year ago

@mahaddad I have updated the test docker image. Can you try again with the following EC2 config?

# An unique identifier for the head node and workers of this cluster.
cluster_name: aviary-deploy

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: False
docker:
    image: "anyscale/aviary:test"
    # Use this image instead for continuous batching:
    # image: "anyscale/aviary:latest-tgi"
    container_name: "aviary"
    run_options:
      - --entrypoint ""

setup_commands:
  - which ray || pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"

worker_start_ray_commands:
    - ray stop
    # We need to make sure RAY_HEAD_IP env var is accessible.
    - export RAY_HEAD_IP && echo "export RAY_HEAD_IP=$RAY_HEAD_IP" >> ~/.bashrc && ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

available_node_types:
  head_node_type:
    node_config:
      InstanceType: m5.xlarge
      BlockDeviceMappings: &mount
      - DeviceName: /dev/sda1
        Ebs:
            VolumeSize: 256
            VolumeType: gp3
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
            VolumeSize: 256
            VolumeType: gp3
    resources:
      head_node: 1
      instance_type_m5: 1
  gpu_worker_g5:
    node_config:
      InstanceType: g5.12xlarge
      BlockDeviceMappings: *mount
    resources:
      worker_node: 1
      instance_type_g5: 1
      accelerator_type_a10: 1
    min_workers: 0
    max_workers: 8
  gpu_worker_p3:
    node_config:
      InstanceType: p3.8xlarge
      BlockDeviceMappings: *mount
    resources:
      worker_node: 1
      instance_type_p3: 1
      accelerator_type_v100: 1
    min_workers: 0
    max_workers: 4
  cpu_worker:
    node_config:
      InstanceType: m5.xlarge
      BlockDeviceMappings: *mount
    resources:
      worker_node: 1
      instance_type_m5: 1
      accelerator_type_cpu: 1
    min_workers: 0
    max_workers: 16
head_node_type: head_node_type
mahaddad commented 1 year ago

This worked. Thank you for the quick turn around on this! Out of curiosity, what was the root cause and fix?

Yard1 commented 1 year ago

One of the dependencies required for the dashboard somehow become missing as the dockerfile was updated (most likely due to changes to conda setup). Then, the model was failing as it was unable to download files from our S3 mirror due to an update to boto3/awscli requiring additional argument.

All of those changes will be reflected on master today and the latest docker will be updated.

Yard1 commented 1 year ago

Fixed in 0.1.1