ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 87 forks source link

Deploying RayLLM locally failed with exit code 0 even if deployment is ready #73

Open lamhoangtung opened 10 months ago

lamhoangtung commented 10 months ago

Hi, I'm trying to deploy meta-llama--Llama-2-7b-chat-hf.yaml using the instruction provided in the README. The deployment seems to work but just when everything is about to ready, it just exit without any error:

(base) ray@35cf69569a48:~/models/continuous_batching$ aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
[WARNING 2023-10-16 09:04:22,790] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
[INFO 2023-10-16 09:04:24,848] accelerator.py: 171  Failed to detect number of TPUs: [Errno 2] No such file or directory: '/dev/vfio'
2023-10-16 09:04:24,987 INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[INFO 2023-10-16 09:04:26,208] api.py: 148  Nothing to shut down. There's no Serve application running on this Ray cluster.
[INFO 2023-10-16 09:04:26,269] deployment_base_client.py: 28  Initialized with base handles {'meta-llama/Llama-2-7b-chat-hf': <ray.serve.deployment.Application object at 0x7f1a8e5a94c0>}
/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/api.py:519: UserWarning: Specifying host and port in `serve.run` is deprecated and will be removed in a future version. To specify custom HTTP options, use `serve.start`.
  warnings.warn(
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,523 http_proxy 172.17.0.2 http_proxy.py:1428 - Proxy actor 69fb321f9360031e80d6562c01000000 starting on node 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25.
[INFO 2023-10-16 09:04:28,555] api.py: 328  Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=22159) INFO 2023-10-16 09:04:28,530 http_proxy 172.17.0.2 http_proxy.py:1612 - Starting HTTP server on node: 82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25 listening on port 8000
(HTTPProxyActor pid=22159) INFO:     Started server process [22159]
(ServeController pid=22117) INFO 2023-10-16 09:04:28,689 controller 22117 deployment_state.py:1390 - Deploying new version of deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,690 controller 22117 deployment_state.py:1390 - Deploying new version of deployment Router in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,793 controller 22117 deployment_state.py:1679 - Adding 1 replica to deployment VLLMDeployment:meta-llama--Llama-2-7b-chat-hf in application 'router'.
(ServeController pid=22117) INFO 2023-10-16 09:04:28,796 controller 22117 deployment_state.py:1679 - Adding 2 replicas to deployment Router in application 'router'.
(ServeReplica:router:Router pid=22202) [WARNING 2023-10-16 09:04:32,739] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,808] vllm_models.py: 201  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040> PlacementGroupID(371dfe1112ca6705f22ac50c828201000000). {'placement_group_id': '371dfe1112ca6705f22ac50c828201000000', 'name': 'SERVE_REPLICA::router#VLLMDeployment:meta-llama--Llama-2-7b-chat-hf#mZlJZj', 'bundles': {0: {'CPU': 1.0}, 1: {'CPU': 4.0, 'GPU': 1.0}}, 'bundles_to_node_id': {0: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25', 1: '82bb99213668fbebb2628af8b81ae804a43ca3d4e585ba5d93259e25'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 1.814, 'scheduling_latency_ms': 1.728, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state': 'FINISHED'}}
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_models.py: 204  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f10b18d9040>
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:32,809] vllm_node_initializer.py: 38  Starting initialize_node tasks on the workers and local node...
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) [INFO 2023-10-16 09:04:37,474] vllm_node_initializer.py: 53  Finished initialize_node tasks.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-ll
ama/Llama-2-7b-chat-hf', tokenizer_mode=auto, revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:37 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the init
ialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(ServeReplica:router:VLLMDeployment:meta-llama--Llama-2-7b-chat-hf pid=22201) INFO 10-16 09:04:51 llm_engine.py:205] # GPU blocks: 1014, # CPU blocks: 512
[INFO 2023-10-16 09:04:53,741] client.py: 581  Deployment 'VLLMDeployment:meta-llama--Llama-2-7b-chat-hf:biUfsX' is ready. component=serve deployment=VLLMDeployment:meta-llama--Llama-2-7b-chat-hf
[INFO 2023-10-16 09:04:53,741] client.py: 581  Deployment 'Router:QHkGZE' is ready at `http://0.0.0.0:8000/`. component=serve deployment=Router
(pid=22359) [WARNING 2023-10-16 09:04:37,030] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` inst
ead. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for
 more options.)
(base) ray@35cf69569a48:~/models/continuous_batching$ echo $?
0

Noted that I tried to modify the config so it can run on my custom machine with 8 CPU core, 32 GB of RAM and NVIDIA L4 GPU:

(base) ray@35cf69569a48:~/models/continuous_batching$ cat ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
deployment_config:
  autoscaling_config:
    min_replicas: 1
    initial_replicas: 1
    max_replicas: 1
    target_num_ongoing_requests_per_replica: 24
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 15.0
  max_concurrent_queries: 64
  ray_actor_options:
    resources:
      accelerator_type_a10: 0
engine_config:
  model_id: meta-llama/Llama-2-7b-chat-hf
  hf_model_id: meta-llama/Llama-2-7b-chat-hf
  type: VLLMEngine
  engine_kwargs:
    trust_remote_code: true
    max_num_batched_tokens: 4096
    max_num_seqs: 64
    gpu_memory_utilization: 0.95
  max_total_tokens: 4096
  generation:
    prompt_format:
      system: "<<SYS>>\n{instruction}\n<</SYS>>\n\n"
      assistant: " {instruction} </s><s> "
      trailing_assistant: " "
      user: "[INST] {system}{instruction} [/INST]"
      system_in_user: true
      default_system_message: ""
    stopping_sequences: ["<unk>"]
scaling_config:
  num_workers: 1
  num_gpus_per_worker: 1
  num_cpus_per_worker: 4
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type_a10: 0

I also confirmed that my machine can run meta-llama/Llama-2-7b-chat-hf using pure vllm, and the RayLLM seems to confirm that the model can be loaded, so why does it keep exiting ? Am I doing anything wrong here ?

Thank you for checking by

akshay-anyscale commented 10 months ago

hi @lamhoangtung can you try using the serve run command instead. You can refer to the readme here for example usage - https://github.com/ray-project/ray-llm