Open kovern opened 1 month ago
cc @RonanKMcGovern
Thanks Lianmin (@merrymercy).
@kovern this is a question about the template rather than SGLang necessarily. I'll answer here, and feel free to follow up with a new issue in one-click-llms if the issues persist.
What GPUs did you select? Did you ensure that the tensor parallel setting in the template matches the number of GPUs you selected?
You should see a message in logs "the server is ready to roll" or similar. I don't see that in your logs. It looks like downloading is not complete.
I just ran on 2xA100. I had to wait a bit for weight downloads and ensure there was 200 GB of disk space (I've now set the template default to 250 GB).
Thanks
Looks like the mode name is wrong. There is no 3.2 for llama 8B
Swap to 3.1
If that’s an error in the template I’ll fix it later tonight. Thanks
On Sat 26 Oct 2024 at 02:00, Robert Czikkel @.***> wrote:
same issue here, pod has been deploying for 20min now and the same message is being logged to the console:
Screenshot.2024-10-26.at.00.49.15.png (view on web) https://github.com/user-attachments/assets/69d087d4-603a-40ae-ba88-2a9c0824e304
RunPod template:
Screenshot.2024-10-26.at.00.50.26.png (view on web) https://github.com/user-attachments/assets/54f93a35-31a5-4009-9ea2-9f4f0cd58d37
1 x H100 SXM 26 vCPU 251 GB RAM
— Reply to this email directly, view it on GitHub https://github.com/sgl-project/sglang/issues/1610#issuecomment-2439058271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CRLVJUOVJI4M2JBMXDZ5LLRVAVCNFSM6AAAAABPS5T4D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZZGA2TQMRXGE . You are receiving this because you were mentioned.Message ID: @.***>
Checklist
Describe the bug
Tried to use https://github.com/TrelisResearch/one-click-llms for llama 3.1 70b inference on runpod.
I've tried several model (llama 3.1 8b, 70b, ...) with several hardware (2xH100, 4xA100, ...) in several regions.
After the server started I could saw from the gpu memory saturation that the service was started. Yet I could not access the API.
I've tried from outside with the given runpod server address. Tried curl and python openai api. Zero success. From the openai api the error was APIConnectionError, from curl was Connection refused.
Then changed strategy and tried locally, but with similar lack of luck. Curl still failed to connect to 0.0.0.0, and the openai api still got APIConnectionError.
Server logs of one of tre runs: 2024-10-08T14:04:42.084138969-07:00 2024-10-08T14:04:42.084177391-07:00 ========== 2024-10-08T14:04:42.084182932-07:00 == CUDA == 2024-10-08T14:04:42.084230141-07:00 ========== 2024-10-08T14:04:42.089361837-07:00 2024-10-08T14:04:42.089426700-07:00 CUDA Version 12.1.1 2024-10-08T14:04:42.090731427-07:00 2024-10-08T14:04:42.090738982-07:00 Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. 2024-10-08T14:04:42.092021788-07:00 2024-10-08T14:04:42.092023251-07:00 This container image and its contents are governed by the NVIDIA Deep Learning Container License. 2024-10-08T14:04:42.092025495-07:00 By pulling and using the container, you accept the terms and conditions of this license: 2024-10-08T14:04:42.092027369-07:00 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license 2024-10-08T14:04:42.092028901-07:00 2024-10-08T14:04:42.092029813-07:00 A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. 2024-10-08T14:04:42.102473897-07:00 2024-10-08T14:04:48.672137338-07:00 [14:04:48] server_args=ServerArgs(model_path='NousResearch/Meta-Llama-3.1-70B-Instruct', tokenizer_path='NousResearch/Meta-Llama-3.1-70B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=8192, quantization='fp8', served_model_name='NousResearch/Meta-Llama-3.1-70B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=8000, mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=948742540, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=False, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8) 2024-10-08T14:04:55.513847948-07:00 [14:04:55 TP1] Init nccl begin. 2024-10-08T14:04:55.574631614-07:00 [14:04:55 TP0] Init nccl begin. 2024-10-08T14:04:57.108305808-07:00 [14:04:57 TP0] Load weight begin. avail mem=78.38 GB 2024-10-08T14:04:57.108327449-07:00 [14:04:57 TP1] Load weight begin. avail mem=78.38 GB 2024-10-08T14:04:58.483092543-07:00 [14:04:58 TP1] lm_eval is not installed, GPTQ may not be usable 2024-10-08T14:04:58.671272847-07:00 [14:04:58 TP0] lm_eval is not installed, GPTQ may not be usable 2024-10-08T14:04:59.586841489-07:00 INFO 10-08 14:04:59 weight_utils.py:236] Using model weights format ['.safetensors'] 2024-10-08T14:04:59.668187687-07:00 INFO 10-08 14:04:59 weight_utils.py:236] Using model weights format ['.safetensors']
The log looks fine. But whenI tried with the exact curl command provided in the help of the template: curl https://xxxxxxxxxxxxx-8000.proxy.runpod.net/v1/completions -H "Content-Type: application/json" -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}' I've gor error code: 502, which indicates server side problem.
I also tried the one-click vllm versions. Only the llama 3.1 8b model seems to work.
Reproduction
Even tried the openai endpoints:
curl https://xxxxxxxxxxxx-8000.proxy.runpod.net/openai/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'
curl localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'
curl http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "NousResearch/Meta-Llama-3.1-70B-Instruct", "prompt": "Say this is a test", "temperature": 0}'
client=OpenAI(base_url='https://xxxxxxxxxxxx-8000.proxy.runpod.net/openai/v1')
client=OpenAI(base_url='localhost:8000/openai/v1')
client=OpenAI(base_url='http://0.0.0.0:8000/openai/v1')
Environment
Environment was set with one click runpod template