sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.11k stars 359 forks source link

[Bug] T4 Crash #1325

Open Abdulhanan535 opened 1 week ago

Abdulhanan535 commented 1 week ago

Checklist

Describe the bug

WARNING 09-04 10:16:13 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information. [10:16:15] When using sliding window in gemma-2, turn on flashinfer. [10:16:15] server_args=ServerArgs(model_path='lemon07r/Gemma-2-Ataraxy-9B', tokenizer_path='lemon07r/Gemma-2-Ataraxy-9B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=8192, quantization=None, served_model_name='lemon07r/Gemma-2-Ataraxy-9B', chat_template=None, is_embedding=False, host='127.0.0.1', port=2222, additional_ports=[2223, 2224, 2225, 2226], mem_fraction_static=0.95, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=375273475, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=True, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None) /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() [10:16:16 TP0] Init nccl begin. [10:16:16 TP1] Init nccl begin. INFO 09-04 10:16:17 custom_all_reduce_utils.py:203] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 19, in raise e File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 17, in launch_server(server_args) File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 365, in launch_server raise RuntimeError( RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 149, in start_controller_process controller = ControllerSingle( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 83, in init self.tp_server = ModelTpServer( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 99, in init self.model_runner = ModelRunner( File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 109, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 140, in init_torch_distributed initialize_model_parallel(tensor_model_parallel_size=self.tp_size) File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel _TP = init_model_parallel_group(group_ranks, File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group return GroupCoordinator( File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in init self.ca_comm = CustomAllreduce( File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in init if not _can_p2p(rank, world_size): File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p if not gpu_p2p_access_check(rank, i): File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check result = pickle.loads(returned.stdout) _pickle.UnpicklingError: invalid load key, 'W'. , detoken_init_state: init ok

Reproduction

command = [ "python -m sglang.launch_server", "--model-path", model, "--port 2222", "--context-length", str(context_length), "--tp 2", "--disable-cuda-graph", "--mem-fraction-static 0.95", "--enable-p2p-check" ]

Environment

.

zhyncs commented 1 week ago

Please provide detailed env info with python3 -m sglang.check_env

Abdulhanan535 commented 1 week ago

WARNING 09-04 10:31:49 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information. Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] CUDA available: True GPU 0,1: Tesla T4 GPU 0,1 Compute Capability: 7.5 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.3, V12.3.107 CUDA Driver Version: 550.90.07 PyTorch: 2.4.0+cu121 sglang: 0.3.0 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.112.2 hf_transfer: 0.1.8 huggingface_hub: 0.24.6 interegular: 0.3.3 packaging: 24.1 PIL: 9.5.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.9 openai: 1.43.0 anthropic: 0.34.1 NVIDIA Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-3 0 N/A GPU1 PHB X 0-3 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM ulimit soft: 1048576

Abdulhanan535 commented 1 week ago

eh?

zhyncs commented 1 week ago

T4 is sm75, we support it with https://github.com/sgl-project/sglang/pull/1233 It works on GCP T4 with Qwen2 1.5B Instruct. Identify other issues on T4, which in my opinion are not high priority. If I have the bandwidth, I might consider reviewing them.

Abdulhanan535 commented 1 week ago

you mean it's not supported?

zhyncs commented 1 week ago

Nope, what I mean is that we support T4, and we have previously verified on GCP T4 with Qwen 2 1.5B without any issues. Regarding your problem, I don't think it's caused by lack of support, but currently, I don't consider investigating the issue you mentioned a high priority for me. Thanks.

Abdulhanan535 commented 1 week ago

okay.