Open Abdulhanan535 opened 1 week ago
Please provide detailed env info with python3 -m sglang.check_env
WARNING 09-04 10:31:49 cuda.py:22] You are using a deprecated pynvml
package. Please install nvidia-ml-py
instead. See https://pypi.org/project/pynvml for more information.
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
GPU 0,1: Tesla T4
GPU 0,1 Compute Capability: 7.5
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.3.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 9.5.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-3 0 N/A
GPU1 PHB X 0-3 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM ulimit soft: 1048576
eh?
T4 is sm75, we support it with https://github.com/sgl-project/sglang/pull/1233 It works on GCP T4 with Qwen2 1.5B Instruct. Identify other issues on T4, which in my opinion are not high priority. If I have the bandwidth, I might consider reviewing them.
you mean it's not supported?
Nope, what I mean is that we support T4, and we have previously verified on GCP T4 with Qwen 2 1.5B without any issues. Regarding your problem, I don't think it's caused by lack of support, but currently, I don't consider investigating the issue you mentioned a high priority for me. Thanks.
okay.
Checklist
Describe the bug
WARNING 09-04 10:16:13 cuda.py:22] You are using a deprecated
raise e
File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 17, in
launch_server(server_args)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/server.py", line 365, in launch_server
raise RuntimeError(
RuntimeError: Initialization failed. controller_init_state: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 149, in start_controller_process
controller = ControllerSingle(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/controller_single.py", line 83, in init
self.tp_server = ModelTpServer(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 99, in init
self.model_runner = ModelRunner(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 109, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 140, in init_torch_distributed
initialize_model_parallel(tensor_model_parallel_size=self.tp_size)
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
_TP = init_model_parallel_group(group_ranks,
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in init
self.ca_comm = CustomAllreduce(
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in init
if not _can_p2p(rank, world_size):
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
if not gpu_p2p_access_check(rank, i):
File "/opt/conda/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
result = pickle.loads(returned.stdout)
_pickle.UnpicklingError: invalid load key, 'W'.
, detoken_init_state: init ok
pynvml
package. Please installnvidia-ml-py
instead. See https://pypi.org/project/pynvml for more information. [10:16:15] When using sliding window in gemma-2, turn on flashinfer. [10:16:15] server_args=ServerArgs(model_path='lemon07r/Gemma-2-Ataraxy-9B', tokenizer_path='lemon07r/Gemma-2-Ataraxy-9B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=8192, quantization=None, served_model_name='lemon07r/Gemma-2-Ataraxy-9B', chat_template=None, is_embedding=False, host='127.0.0.1', port=2222, additional_ports=[2223, 2224, 2225, 2226], mem_fraction_static=0.95, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=375273475, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=True, enable_mla=False, triton_attention_reduce_in_fp32=False, nccl_init_addr=None, nnodes=1, node_rank=None) /opt/conda/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() [10:16:16 TP0] Init nccl begin. [10:16:16 TP1] Init nccl begin. INFO 09-04 10:16:17 custom_all_reduce_utils.py:203] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/sglang/launch_server.py", line 19, inReproduction
command = [ "python -m sglang.launch_server", "--model-path", model, "--port 2222", "--context-length", str(context_length), "--tp 2", "--disable-cuda-graph", "--mem-fraction-static 0.95", "--enable-p2p-check" ]
Environment
.