ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 87 forks source link

Multiple models second models always request GPU: 1 #94

Open lynkz-matt-psaltis opened 9 months ago

lynkz-matt-psaltis commented 9 months ago

Using the instructions here: https://github.com/ray-project/ray-llm#how-do-i-deploy-multiple-models-at-once I'm trying to host two models on a single A100 80G.

Two bundles are generated for the placement group:

{0: {'accelerator_type:A100': 0.1, 'CPU': 1.0}
{1: {'accelerator_type:A100': 0.1, 'GPU': 1.0, 'CPU': 1.0}}

Bundle 0 correctly generates with my configured CPU and accelerator type. Bundle 1 adds in an additional GPU requirement.

Now, if I swap the order of the models in the multi-model config, the first model always boots and the second model doesn't because the superfluous GPU:1 entry is added (I think).

This always leads to the following log entries for the second model - e.g.

deployment_state.py:1974 - Deployment 'VLLMDeployment:TheBloke--Mistral-7b-OpenOrca-AWQ' in application 'ray-llm' has 1 replicas that have taken more t
han 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"accelerator_type:A100": 0.1, "CPU": 1.0}, {"accelerator_type
:A100": 0.1, "CPU": 1.0, "GPU": 1.0}], total resources available: {}. Use `ray status` for more details.

Trying to work out if this a bug or my misunderstanding? Happy to provide further details as needed :)

So far I've tried the provided containers plus building from source.

akshay-anyscale commented 9 months ago

Can you share the model yamls that you are using? You'll need to set num_gpus_per_worker to 0.5 for both

lynkz-matt-psaltis commented 9 months ago

Thanks @akshay-anyscale Model yamls attached below

I've tried 0.25 and 0.5 for the num_gpus_per_worker value.

It definitely seems to pick it up in the early boot logs:

[INFO 2023-11-24 20:57:14,237] vllm_models.py: 218  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7fc36bad
b910> PlacementGroupID(bd37662d255d801b76ce786c415a03000000). {'placement_group_id': 'bd37662d255d801b76ce786c415a03000000', 'name': 'SERVE_REPLICA::ray-llm#VLLMDeployment:TheBloke--Phind-CodeLlama-34B-v2-AWQ#rYXDuK', 'bundles':
 {0: {'CPU': 1.0, 'accelerator_type:A100': 0.01}, 1: {'GPU': 0.5, 'CPU': 1.0, 'accelerator_type:A100': 0.01}}, 'bundles_to_node_id': {0: '63b4d12933792fa0973bf31b1f02b5879bb14383b55724d0d26d275e', 1: '63b4d12933792fa0973bf31b1f0
2b5879bb14383b55724d0d26d275e'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 2.023, 'scheduling_latency_ms': 1.922, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduli
ng_state': 'FINISHED'}}
[INFO 2023-11-24 20:57:14,200] vllm_models.py: 218  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7fcadfb89d1
0> PlacementGroupID(72db94d05c8cd40659b5da076ba203000000). {'placement_group_id': '72db94d05c8cd40659b5da076ba203000000', 'name': 'SERVE_REPLICA::ray-llm#VLLMDeployment:TheBloke--Mistral-7b-OpenOrca-AWQ#ynFfcK', 'bundles': {0: {
'CPU': 1.0, 'accelerator_type:A100': 0.1}, 1: {'GPU': 0.5, 'CPU': 1.0, 'accelerator_type:A100': 0.1}}, 'bundles_to_node_id': {0: '63b4d12933792fa0973bf31b1f02b5879bb14383b55724d0d26d275e', 1: '63b4d12933792fa0973bf31b1f02b5879bb
14383b55724d0d26d275e'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 5.815, 'scheduling_latency_ms': 5.719, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state
': 'FINISHED'}}

ValueError: Cannot schedule RayWorker with the placement group because the resource request {'CPU': 0, 'GPU': 1} cannot fit into any bundles for the placement group, [{'CPU': 1.0, 'accelerator_type:A10 0': 0.1}, {'GPU': 0.5, 'CPU': 1.0, 'accelerator_type:A100': 0.1}]

deployment_config:
  autoscaling_config:
    min_replicas: 1
    initial_replicas: 1
    max_replicas: 1
    target_num_ongoing_requests_per_replica: 24
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 15.0
  max_concurrent_queries: 64
  ray_actor_options:
    resources:
      accelerator_type:A100: 0.1
engine_config:
  model_id: TheBloke/Mistral-7b-OpenOrca-AWQ
  hf_model_id: /mnt/models/TheBloke/Mistral-7b-OpenOrca-AWQ
  type: VLLMEngine
  engine_kwargs:
    quantization: awq
    trust_remote_code: true
    max_num_batched_tokens: 4096
    max_num_seqs: 64
    gpu_memory_utilization: 0.95
  max_total_tokens: 4096
  generation:
    prompt_format:
      system: "<|im_start|>system\n{instruction}<|im_end|>"
      assistant: "<|im_start|>assistant {instruction} </s>"
      trailing_assistant: ""
      user: "<|im_start|>user\n{instruction}<|im_end|>"
      system_in_user: false
      default_system_message: ""
    stopping_sequences: ["<unk>"]
scaling_config:
  num_workers: 1
  num_gpus_per_worker: 0.5
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type:A100: 0.1
deployment_config:
  autoscaling_config:
    min_replicas: 1
    initial_replicas: 1
    max_replicas: 1
    target_num_ongoing_requests_per_replica: 32
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 10.0
  max_concurrent_queries: 128
  ray_actor_options:
    resources:
      accelerator_type:A100: 0.01
engine_config:
  model_id: TheBloke/Phind-CodeLlama-34B-v2-AWQ
  hf_model_id: /dev/shm
  type: VLLMEngine
  engine_kwargs:
    tokenizer: hf-internal-testing/llama-tokenizer
    quantization: awq
    trust_remote_code: false
    max_num_batched_tokens: 16384
    max_num_seqs: 128
    gpu_memory_utilization: 0.45
  max_total_tokens: 16384
  generation:
    prompt_format:
      system: "### System Prompt\n{instruction}\n"
      assistant: "### Assistant{instruction}</s>"
      trailing_assistant: ""
      user: "### User Message\n{instruction}\n"
      system_in_user: false
      default_system_message: ""
    stopping_sequences: ["<unk>"]
scaling_config:
  num_workers: 1
  num_gpus_per_worker: 0.5
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type:A100: 0.01