sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.23k stars 538 forks source link

[Bug] Can't access one click llms on runpod #1610

Open kovern opened 1 month ago

kovern commented 1 month ago

Checklist

Describe the bug

Tried to use https://github.com/TrelisResearch/one-click-llms for llama 3.1 70b inference on runpod.

I've tried several model (llama 3.1 8b, 70b, ...) with several hardware (2xH100, 4xA100, ...) in several regions.

After the server started I could saw from the gpu memory saturation that the service was started. Yet I could not access the API.

I've tried from outside with the given runpod server address. Tried curl and python openai api. Zero success. From the openai api the error was APIConnectionError, from curl was Connection refused.

Then changed strategy and tried locally, but with similar lack of luck. Curl still failed to connect to 0.0.0.0, and the openai api still got APIConnectionError.

Server logs of one of tre runs: 2024-10-08T14:04:42.084138969-07:00 2024-10-08T14:04:42.084177391-07:00 ========== 2024-10-08T14:04:42.084182932-07:00 == CUDA == 2024-10-08T14:04:42.084230141-07:00 ========== 2024-10-08T14:04:42.089361837-07:00 2024-10-08T14:04:42.089426700-07:00 CUDA Version 12.1.1 2024-10-08T14:04:42.090731427-07:00 2024-10-08T14:04:42.090738982-07:00 Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 2024-10-08T14:04:42.092021788-07:00 2024-10-08T14:04:42.092023251-07:00 This container image and its contents are governed by the NVIDIA Deep Learning Container License. 2024-10-08T14:04:42.092025495-07:00 By pulling and using the container, you accept the terms and conditions of this license: 2024-10-08T14:04:42.092027369-07:00 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license 2024-10-08T14:04:42.092028901-07:00 2024-10-08T14:04:42.092029813-07:00 A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. 2024-10-08T14:04:42.102473897-07:00 2024-10-08T14:04:48.672137338-07:00 [14:04:48] server_args=ServerArgs(model_path='NousResearch/Meta-Llama-3.1-70B-Instruct', tokenizer_path='NousResearch/Meta-Llama-3.1-70B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=8192, quantization='fp8', served_model_name='NousResearch/Meta-Llama-3.1-70B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=8000, mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=948742540, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=False, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8) 2024-10-08T14:04:55.513847948-07:00 [14:04:55 TP1] Init nccl begin. 2024-10-08T14:04:55.574631614-07:00 [14:04:55 TP0] Init nccl begin. 2024-10-08T14:04:57.108305808-07:00 [14:04:57 TP0] Load weight begin. avail mem=78.38 GB 2024-10-08T14:04:57.108327449-07:00 [14:04:57 TP1] Load weight begin. avail mem=78.38 GB 2024-10-08T14:04:58.483092543-07:00 [14:04:58 TP1] lm_eval is not installed, GPTQ may not be usable 2024-10-08T14:04:58.671272847-07:00 [14:04:58 TP0] lm_eval is not installed, GPTQ may not be usable 2024-10-08T14:04:59.586841489-07:00 INFO 10-08 14:04:59 weight_utils.py:236] Using model weights format ['.safetensors'] 2024-10-08T14:04:59.668187687-07:00 INFO 10-08 14:04:59 weight_utils.py:236] Using model weights format ['.safetensors']

The log looks fine. But whenI tried with the exact curl command provided in the help of the template: curl https://xxxxxxxxxxxxx-8000.proxy.runpod.net/v1/completions -H "Content-Type: application/json" -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}' I've gor error code: 502, which indicates server side problem.

I also tried the one-click vllm versions. Only the llama 3.1 8b model seems to work.

Reproduction

Even tried the openai endpoints:

curl https://xxxxxxxxxxxx-8000.proxy.runpod.net/openai/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'

curl localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'

curl http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'

client=OpenAI(base_url='https://xxxxxxxxxxxx-8000.proxy.runpod.net/openai/v1')

client=OpenAI(base_url='localhost:8000/openai/v1')

client=OpenAI(base_url='http://0.0.0.0:8000/openai/v1')

Environment

Environment was set with one click runpod template

merrymercy commented 1 month ago

cc @RonanKMcGovern

RonanKMcGovern commented 1 month ago

Thanks Lianmin (@merrymercy).

@kovern this is a question about the template rather than SGLang necessarily. I'll answer here, and feel free to follow up with a new issue in one-click-llms if the issues persist.

  1. What GPUs did you select? Did you ensure that the tensor parallel setting in the template matches the number of GPUs you selected?

  2. You should see a message in logs "the server is ready to roll" or similar. I don't see that in your logs. It looks like downloading is not complete. image

  3. I just ran on 2xA100. I had to wait a bit for weight downloads and ensure there was 200 GB of disk space (I've now set the template default to 250 GB).

    image

Thanks

RonanKMcGovern commented 1 month ago

Looks like the mode name is wrong. There is no 3.2 for llama 8B

Swap to 3.1

If that’s an error in the template I’ll fix it later tonight. Thanks

On Sat 26 Oct 2024 at 02:00, Robert Czikkel @.***> wrote:

same issue here, pod has been deploying for 20min now and the same message is being logged to the console:

Screenshot.2024-10-26.at.00.49.15.png (view on web) https://github.com/user-attachments/assets/69d087d4-603a-40ae-ba88-2a9c0824e304

RunPod template:

Screenshot.2024-10-26.at.00.50.26.png (view on web) https://github.com/user-attachments/assets/54f93a35-31a5-4009-9ea2-9f4f0cd58d37

1 x H100 SXM 26 vCPU 251 GB RAM

— Reply to this email directly, view it on GitHub https://github.com/sgl-project/sglang/issues/1610#issuecomment-2439058271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CRLVJUOVJI4M2JBMXDZ5LLRVAVCNFSM6AAAAABPS5T4D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGA2TQMRXGE . You are receiving this because you were mentioned.Message ID: @.***>