vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.98k stars 4.31k forks source link

[Bug]: 500 Internal Server Error when calling v1/completions and v1/chat/completions with vllm/vllm-openai:v0.6.2 on K8s #9193

Closed apexx77 closed 1 week ago

apexx77 commented 2 weeks ago

Your current environment

The output of `python collect_env.py` ```text $ python3 collect_env.py Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.12.6 (main, Sep 10 2024, 00:05:17) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.14.0-427.28.1.el9_4.x86_64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Gold 6448Y Stepping: 8 CPU MHz: 4100.000 CPU max MHz: 4100.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 3 MiB L1i cache: 2 MiB L2 cache: 128 MiB L3 cache: 120 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Versions of relevant libraries: [pip3] flashinfer==0.1.6+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.0 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.dev238+ge2c6e0a82 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0,2,4,6,8,10 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks $ ```

Model Input Dumps

No response

🐛 Describe the bug

Getting "500 Internal Server Error" while calling v1/completions and v1/chat/completions endpoints when deployed on Kubernetes. Remaining endpoints sych as tokenize and v1/models are working as expected. Followed the deployment guide provided here.

INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
DEBUG 10-09 05:19:37 client.py:148] Heartbeat successful.
INFO:     10.121.X.X:33112 - "GET /health HTTP/1.1" 200 OK
INFO:     10.121.X.X:33122 - "GET /v1/models HTTP/1.1" 200 OK
INFO 10-09 05:19:37 logger.py:36] Received request tokn-788eaaca463f4228a9458b94a97bb373: prompt: 'Sample sentence for API testing', params: None, prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.121.X.X:33124 - "POST /tokenize HTTP/1.1" 200 OK
INFO:     10.121.X.X:33134 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     10.121.X.X:33142 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
DEBUG 10-09 05:19:39 client.py:148] Heartbeat successful.
DEBUG 10-09 05:19:41 client.py:148] Heartbeat successful.
DEBUG 10-09 05:19:43 client.py:148] Heartbeat successful.
DEBUG 10-09 05:19:44 client.py:164] Waiting for output from MQLLMEngine.
DEBUG 10-09 05:19:45 client.py:148] Heartbeat successful.
DEBUG 10-09 05:19:45 engine.py:212] Waiting for new requests in engine loop.
DEBUG 10-09 05:19:47 client.py:148] Heartbeat successful.

Here is the yml used to create the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: vllm-app
  name: vllm-deployment
  namespace: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-app
  template:
    metadata:
      labels:
        app: vllm-app
    spec:
      containers:
      - command: ["/bin/sh", "-c"]
        args: ["vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 --trust-remote-code --enable-chunked-prefill --max-model-len 16384 --disable-log-stats"]
        image: vllm/vllm-openai:v0.6.2
        env:
        - name: HF_HOME
          value: /home/.cache/huggingface
        - name: VLLM_CONFIG_ROOT
          value: /home/.cache/vllm
        - name: VLLM_CACHE_ROOT
          value: /home/.cache/vllm
        - name: VLLM_LOGGING_LEVEL
          value: DEBUG
        - name: VLLM_TRACE_FUNCTION
          value: "1"
        imagePullPolicy: Always
        name: vllm-openai
        ports:
        - containerPort: 8000
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 2
            memory: 20Gi
          requests:
            nvidia.com/gpu: 1
            cpu: 2
            memory: 20Gi
        volumeMounts:
        - mountPath: /home/.cache/
          name: cache-volume
        - mountPath: /dev/shm
          name: shm
      volumes:
      - name: cache-volume 
        persistentVolumeClaim:
          claimName: llama3-1-8b
      - name: shm
        emptyDir: 
          medium: Memory
          sizeLimit: "4Gi"

There is no detailed trace for the error even after setting VLLM_LOGGING_LEVEL and VLLM_TRACE_FUNCTION env variables.

Do we need to change any configuration to get it working as expected?

Before submitting a new issue...

s-sajid-ali commented 1 week ago

@apexx77: Could you elaborate on how this issue was fixed? I'm seeing the same issue with vllm@0.6.3:

Relevant trace:

TRACE:    127.0.0.1:59256 - HTTP connection made
TRACE:    127.0.0.1:59256 - ASGI [406] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.4'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8000), 'client': ('127.0.0.1', 59256), 'scheme': 'http', 'root_path': '', 'headers': '<...>', 'state': {}, 'method': 'POST', 'path': '/v1/completions', 'raw_path': b'/v1/completions', 'query_string': b''}
TRACE:    127.0.0.1:59256 - ASGI [406] Receive {'type': 'http.request', 'body': '<148 bytes>', 'more_body': False}
INFO 10-16 09:13:55 logger.py:37] Received request cmpl-6e6e9369338a45a784d11cdb1f825a6d-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1, 3087, 8970, 338, 263], lora_request: None, prompt_adapter_request: None.
DEBUG 10-16 09:13:55 async_llm_engine.py:525] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
TRACE:    127.0.0.1:59256 - ASGI [406] Send {'type': 'http.response.start', 'status': <HTTPStatus.INTERNAL_SERVER_ERROR: 500>, 'headers': '<...>'}
INFO:     127.0.0.1:59256 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
TRACE:    127.0.0.1:59256 - ASGI [406] Send {'type': 'http.response.body', 'body': '<0 bytes>'}
TRACE:    127.0.0.1:59256 - ASGI [406] Completed
apexx77 commented 1 week ago

I had an issue with couple of Permission errors which was not being logged in v0.6.2 but it is getting logged in v0.6.3. Try increasing verbosity by setting VLLM_LOGGING_LEVEL to DEBUG and enabling VLLM_TRACE_FUNCTION. Maybe this would help!

s-sajid-ali commented 1 week ago

Thanks @apexx77!

Cognitus-Stuti commented 4 hours ago

@s-sajid-ali @apexx77 I'm facing the same issue with v.0.6.3, the VLLM_LOGGING_LEVEL has already been set to debug

apexx77 commented 4 hours ago

@Cognitus-Stuti Can you share the server side logs for different endpoints?

Cognitus-Stuti commented 2 hours ago

@apexx77

2024-10-23T17:56:46.209003136Z DEBUG 10-23 10:56:46 engine.py:213] Waiting for new requests in engine loop.
2024-10-23T17:56:46.849028715Z INFO:     127.0.0.6:58625 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:56:46.849345223Z INFO:     127.0.0.6:41309 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:56:48.001119234Z DEBUG 10-23 10:56:48 client.py:154] Heartbeat successful.
2024-10-23T17:56:50.001894818Z DEBUG 10-23 10:56:50 client.py:154] Heartbeat successful.
2024-10-23T17:56:52.002048929Z DEBUG 10-23 10:56:52 client.py:154] Heartbeat successful.
2024-10-23T17:56:52.552799029Z INFO 10-23 10:56:52 logger.py:37] Received request cmpl-8b95632afe9d47d4b2dbc83fa3b84ef1-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 24661, 13175, 374, 264], lora_request: None, prompt_adapter_request: None.
2024-10-23T17:56:52.553199492Z DEBUG 10-23 10:56:52 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingP
arams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
2024-10-23T17:56:52.720697071Z INFO:     127.0.0.1:35284 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
2024-10-23T17:56:53.920775856Z INFO 10-23 10:56:53 logger.py:37] Received request cmpl-26c1ca56027e4438a652d88e3f956e28-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 24661, 13175, 374, 264], lora_request: None, prompt_adapter_request: None.
2024-10-23T17:56:53.921120439Z DEBUG 10-23 10:56:53 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
2024-10-23T17:56:53.923515159Z INFO:     127.0.0.1:35284 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
2024-10-23T17:56:54.002195917Z DEBUG 10-23 10:56:54 client.py:154] Heartbeat successful.
2024-10-23T17:56:55.143641042Z DEBUG 10-23 10:56:55 client.py:170] Waiting for output from MQLLMEngine.
2024-10-23T17:56:55.871292495Z INFO 10-23 10:56:55 logger.py:37] Received request cmpl-15d0e6644c3d4179adf03bcb66413e74-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, s
kip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 24661, 13175, 374, 264], lora_request: None, prompt_adapter_request: None.
2024-10-23T17:56:55.871382909Z DEBUG 10-23 10:56:55 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
2024-10-23T17:56:55.874380202Z INFO:     127.0.0.1:35284 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
2024-10-23T17:56:56.002494655Z DEBUG 10-23 10:56:56 client.py:154] Heartbeat successful.
2024-10-23T17:56:56.222440545Z INFO 10-23 10:56:56 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
2024-10-23T17:56:56.230686930Z DEBUG 10-23 10:56:56 engine.py:213] Waiting for new requests in engine loop.
2024-10-23T17:56:56.849636520Z INFO:     127.0.0.6:47585 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:56:56.849884187Z INFO:     127.0.0.6:49583 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:56:58.002600344Z DEBUG 10-23 10:56:58 client.py:154] Heartbeat successful.
2024-10-23T17:57:00.003167664Z DEBUG 10-23 10:57:00 client.py:154] Heartbeat successful.
2024-10-23T17:57:02.003177236Z DEBUG 10-23 10:57:02 client.py:154] Heartbeat successful.
2024-10-23T17:57:04.003650725Z DEBUG 10-23 10:57:04 client.py:154] Heartbeat successful.
2024-10-23T17:57:05.144441188Z DEBUG 10-23 10:57:05 client.py:170] Waiting for output from MQLLMEngine.
2024-10-23T17:57:06.003778954Z DEBUG 10-23 10:57:06 client.py:154] Heartbeat successful.
2024-10-23T17:57:06.242990180Z INFO 10-23 10:57:06 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swa
pped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
2024-10-23T17:57:06.251294485Z DEBUG 10-23 10:57:06 engine.py:213] Waiting for new requests in engine loop.
2024-10-23T17:57:06.849293249Z INFO:     127.0.0.6:60701 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:57:06.849553784Z INFO:     127.0.0.6:60065 - "GET /health HTTP/1.1" 200 OK
2024-10-23T17:57:08.003912244Z DEBUG 10-23 10:57:08 client.py:154] Heartbeat successful.
2024-10-23T17:57:08.093336071Z INFO 10-23 10:57:08 logger.py:37] Received request chat-c63fb2801fdd47bf95478436ef4fbd06: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 23 Oct 2024\n\n[{'type': 'text', 'text': 'hi'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5950, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1419, 5020, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 6151, 8439, 60, 128009, 128006, 882, 128007, 271, 15339, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
2024-10-23T17:57:08.093456461Z DEBUG 10-23 10:57:08 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
2024-10-23T
17:57:08.095711350Z INFO:     127.0.0.1:51956 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
2024-10-23T17:57:10.004081141Z DEBUG 10-23 10:57:10 client.py:154] Heartbeat successful.
2024-10-23T17:57:12.004698727Z DEBUG 10-23 10:57:12 client.py:154] Heartbeat successful.
2024-10-23T17:57:14.004534482Z DEBUG 10-23 10:57:14 client.py:154] Heartbeat successful.
2024-10-23T17:57:15.144883312Z DEBUG 10-23 10:57:15 client.py:170] Waiting for output from MQLLMEngine.
apexx77 commented 2 hours ago

Well, nothing much can be inferred from the above logs, try setting VLLM_TRACE_FUNCTION and also you can check if the model is loaded correctly.

Cognitus-Stuti commented 2 hours ago

VLLM_TRACE_FUNCTION I did add the same, the model has been loaded and all other endpoints except chat/completions and /completions work as expected @apexx77

Cognitus-Stuti commented 1 hour ago

@apexx77 Start up logs

DEBUG 10-23 12:29:43 scripts.py:138] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
 INFO 10-23 12:29:43 api_server.py:528] vLLM API server version 0.6.3.post1
 INFO 10-23 12:29:43 api_server.py:529] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-3.2-11B-Vision-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=6000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.97, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=5, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_p
rompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f740aefe660>)
2024-10-23T19:29:43.296638509Z INFO 10-23 12:29:43 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/71829397-230a-4347-aa1b-2bb738ce67ce for IPC Path.
2024-10-23T19:29:43.297995107Z INFO 10-23 12:29:43 api_server.py:179] Started engine process with PID 61
2024-10-23T19:29:43.708812655Z WARNING 10-23 12:29:43 config.py:1668] Casting torch.bfloat16 to torch.float16.
2024-10-23T19:29:48.029282532Z WARNING 10-23 12:29:48 config.py:1668] Casting torch.bfloat16 to torch.float16.
2024-10-23T19:29:49.486792952Z INFO 10-23 12:29:49 config.py:905] Defaulting to use mp for distributed inference
2024-10-23T19:29:49.486845283Z WARNING 10-23 12:29:49 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been remo
ved, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
2024-10-23T19:29:49.486947053Z WARNING 10-23 12:29:49 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-23 12:29:53 config.py:905] Defaulting to use mp for distributed inference
2024-10-23T19:29:53.491404830Z WARNING 10-23 12:29:53 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
2024-10-23T19:29:53.491515341Z WARNING 10-23 12:29:53 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-23 12:29:53 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=6000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-23 12:29:54 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-23 12:29:54 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 10-23 12:29:54 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
\ INFO 10-23 12:29:54 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b2a0f4e7cfb1470ba3fa5810f3b885a0/VLLM_TRACE_FUNCTION_for_process_61_thread_140173577913472_at_2024-10-23_12:29:54.213561.log
INFO 10-23 12:29:54 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-23 12:29:54 selector.py:115] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
DEBUG 10-23 12:29:56 parallel_state.py:929] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:33119 backend=nccl
(VllmWorkerProcess pid=273) WARNING 10-23 12:29:58 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
2024-10-23T19:29:58.542768069Z (VllmWorkerProcess pid=273) INFO 10-23 12:29:58 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b2a0f4e7cfb1470ba3fa5810f3b885a0/VLLM_TRACE_FUNCTION_for_process_273_thread_139667229246592_at_2024-10-23_12:29:58.542421.log
(VllmWorkerProcess pid=272) WARNING 10-23 12:29:58 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=272) INFO 10-23 12:29:58 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b2a0f4e7cfb1470ba3fa5810f3b885a0/VLLM_TRACE_FUNCTION_for_process_272_thread_140361468826752_at_2024-10-23_12:29:58.543437.log
(VllmWorkerProcess pid=271) WARNING 10-23 12:29:58 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=271) INFO 10-23 12:29:58 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-b2a0f4e7cfb1470ba3fa5810f3b885a0/VLLM_TRACE_FUNCTION_for_process_271_thread_139787932144768_at_2024-10-23_12:29:58.544124.log
(VllmWorkerProcess pid=273) INFO 10-23 12:29:58 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=272) INFO 10-23 12:29:58 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=271) INFO 10-23 12:29:58 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=273) INFO 10-23 12:29:58 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=272) INFO 10-23 12:29:58 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=271) INFO 10-23 12:29:58 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=271) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.013901828Z (VllmWorkerProcess pid=271)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=272) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.017964113Z (VllmWorkerProcess pid=272)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=273) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.019057254Z (VllmWorkerProcess pid=273)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=271) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.333367097Z (VllmWorkerProcess pid=271)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=272) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.340469214Z (VllmWorkerProcess pid=272)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=273) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-23T19:29:59.342558776Z (VllmWorkerProcess pid=273)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=272) INFO 10-23 12:29:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=273) INFO 10-23 12:29:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=271) INFO 10-23 12:29:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 10-23 12:30:00 client.py:170] Waiting for output from MQLLMEngine.
(VllmWorkerProcess pid=273) DEBUG 10-23 12:30:00 parallel_state.py:929] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:33119 backend=nccl
(VllmWorkerProcess pid=271) DEBUG 10-23 12:30:00 parallel_state.py:929] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:33119 backend=nccl
(VllmWorkerProcess pid=272) DEBUG 10-23 12:30:00 parallel_state.py:929] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:33119 backend=nccl
(VllmWorkerProcess pid=271) INFO 10-23 12:30:00 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-23 12:30:00 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=271) INFO 10-23 12:30:00 pynccl.py:63] vLLM is using nccl==2.20.5
2024-10-23T19:30:00.580578759Z (VllmWorkerProcess pid=272) INFO 10-23 12:30:00 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=273) INFO 10-23 12:30:00 utils.py:1008] Found nccl from library libnccl.so.2
2024-10-23T19:30:00.580821885Z INFO 10-23 12:30:00 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=272) INFO 10-23 12:30:00 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=273) INFO 10-23 12:30:00 pynccl.py:63] vLLM is using nccl==2.20.5
lax-ai-vllm-model-78d9c658c8-qlw8z:61:61 [0] NCCL INFO Bootstrap : Using eth0:10.130.6.121<0>
2024-10-23T19:30:00.592184340Z lax-ai-vllm-model-78d9c658c8-qlw8z:61:61 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
lax-ai-vllm-model-78d9c658c8-qlw8z:61:61 [0] NCCL INFO cudaDriverVersion 12040
2024-10-23T19:30:00.592184340Z NCCL version 2.20.5+cuda12.4
(VllmWorkerProcess pid=272) WARNING 10-23 12:30:00 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-10-23T19:30:00.948667703Z WARNING 10-23 12:30:00 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-10-23T19:30:00.948667703Z (VllmWorkerProcess pid=271) WARNING 10-23 12:30:00 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-10-23T19:30:00.948712413Z (VllmWorkerProcess pid=273) WARNING 10-23 12:30:00 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
DEBUG 10-23 12:30:00 shm_broadcast.py:201] Binding to tcp://127.0.0.1:54129
INFO 10-23 12:30:00 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f7b5c619100>, local_subscribe_port=54129, remote_subscribe_port=None)
(VllmWorkerProcess pid=272) DEBUG 10-23 12:30:00 shm_broadcast.py:265] Connecting to tcp://127.0.0.1:54129
2024-10-23T19:30:00.974109209Z (VllmWorkerProcess pid=273) DEBUG 10-23 12:30:00 shm_broadcast.py:265] Connecting to tcp://127.0.0.1:54129
2024-10-23T19:30:00.974109209Z (VllmWorkerProcess pid=271) DEBUG 10-23 12:30:00 shm_broadcast.py:265] Connecting to tcp://127.0.0.1:54129
(VllmWorkerProcess pid=271) INFO 10-23 12:30:01 model_runner.py:1056] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=272) INFO 10-23 12:30:01 model_runner.py:1056] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
INFO 10-23 12:30:01 model_runner.py:1056] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
2024-10-23T19:30:01.025416914Z (VllmWorkerProcess pid=273) INFO 10-23 12:30:01 model_runner.py:1056] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=272) INFO 10-23 12:30:02 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=271) INFO 10-23 12:30:02 selector.py:115] Using XFormers backend.
INFO 10-23 12:30:02 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=273) INFO 10-23 12:30:02 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=272) INFO 10-23 12:30:04 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=271) INFO 10-23 12:30:04 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=273) INFO 10-23 12:30:04 weight_utils.py:243] Using model weights format ['*.safetensors']
INFO 10-23 12:30:04 weight_utils.py:243] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
DEBUG 10-23 12:30:10 client.py:170] Waiting for output from MQLLMEngine.

Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:09<00:37,  9.29s/it]

Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:12<00:17,  5.71s/it]
DEBUG 10-23 12:30:20 client.py:170] Waiting for output from MQLLMEngine.
DEBUG 10-23 12:30:30 client.py:170] Waiting for output from MQLLMEngine.

Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:24<00:16,  8.48s/it]
DEBUG 10-23 12:30:40 client.py:170] Waiting for output from MQLLMEngine.

Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:36<00:09,  9.80s/it]
DEBUG 10-23 12:30:50 client.py:170] Waiting for output from MQLLMEngine.

Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:47<00:00, 10.50s/it]

Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:47<00:00,  9.57s/it]

(VllmWorkerProcess pid=273) INFO 10-23 12:30:55 model_runner.py:1067] Loading model weights took 5.1560 GB
(VllmWorkerProcess pid=272) INFO 10-23 12:30:55 model_runner.py:1067] Loading model weights took 5.1560 GB
(VllmWorkerProcess pid=271) INFO 10-23 12:30:55 model_runner.py:1067] Loading model weights took 5.1560 GB
INFO 10-23 12:30:55 model_runner.py:1067] Loading model weights took 5.1560 GB
(VllmWorkerProcess pid=271) INFO 10-23 12:30:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=272) INFO 10-23 12:30:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
INFO 10-23 12:30:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=273) INFO 10-23 12:30:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO cudaDriverVersion 12040
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Bootstrap : Using eth0:10.130.6.121<0>
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO NET/IB : No device found.
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO NET/Socket : Using [0]eth0:10.130.6.121<0>
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Using non-device net plugin version 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Using network Socket
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO comm 0xc90f120 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1c0 commId 0xca78aff2cc1a9f9c - Init START
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO NVLS multicast support is not available on dev 1
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO comm 0xc90f120 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Connected all rings
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO Connected all trees
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:271 [1] NCCL INFO comm 0xc90f120 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1c0 commId 0xca78aff2cc1a9f9c - Init COMPLETE
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Using non-device net plugin version 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Using network Socket
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO comm 0x2619e050 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1c0 commId 0xd6df6a2195
81d5e6 - Init START
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO NVLS multicast support is not available on dev 1
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO comm 0x2619e050 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Connected all rings
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO Connected all trees
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:271:402 [1] NCCL INFO comm 0x2619e050 rank 1 nranks 4 cudaDev 1 nvmlax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO cudaDriverVersion 12040
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Bootstrap : Using eth0:10.130.6.121<0>
lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO NET/IB : No device found.
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO NET/Socket : Using [0]eth0:10.130.6.121<0>
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Using non-device net plugin version 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Using network Socket
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO comm 0xaf2ed40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e0 commId 0xca78aff2cc1a9f9c - Init START
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO NVLS multicast support is not available on dev 3
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO comm 0xaf2ed40 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Connected all rings
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3
] NCCL INFO Channel 00 : 3[3] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Channel 01 : 3[3] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO Connected all trees
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:273 [3] NCCL INFO comm 0xaf2ed40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e0 commId 0xca78aff2cc1a9f9c - Init COMPLETE
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Using non-device net plugin version 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Using network Socket
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO comm 0x244decb0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 1e0 commId 0xd6df6a219581d5e6 - Init START
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO NVLS multicast support is not available on dev 3
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO comm 0x244decb0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/d
irect
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Connected all rings
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Channel 00 : 3[3] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Channel 01 : 3[3] -> 2[2] via SHM/direct/direct
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO Connected all trees
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633819710Z lax-ai-vllm-model-78d9c658c8-qlw8z:273:403 [3] NCCL INFO comm 0x244decb0 rank 3 nranks 4 cudaDev 3lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO cudaDriverVersion 12040
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Bootstrap : Using eth0:10.130.6.121<0>
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO NET/IB : No device found.
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO NET/Socket : Using [0]eth0:10.130.6.121<0>
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Using non-device net plugin version 0
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Using network Socket
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO comm 0xc60d380 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 1d0 commId 0xca78aff2cc1a9f9c
 - Init START
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO NVLS multicast support is not available on dev 2
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO comm 0xc60d380 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Channel 00 : 2[2] -> 3[3] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Channel 01 : 2[2] -> 3[3] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Connected all rings
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO Connected all trees
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:272 [2] NCCL INFO comm 0xc60d380 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 1d0 commId 0xca78aff2cc1a9f9c - Init COMPLETE
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Usin
g non-device net plugin version 0
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Using network Socket
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO comm 0x25a17aa0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 1d0 commId 0xd6df6a219581d5e6 - Init START
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO NVLS multicast support is not available on dev 2
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO comm 0x25a17aa0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO P2P Chunksize set to 131072
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Channel 00 : 2[2] -> 3[3] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Channel 01 : 2[2] -> 3[3] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Connected all rings
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO Connected all trees
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024-10-23T19:30:56.633885910Z lax-ai-vllm-model-78d9c658c8-qlw8z:272:401 [2] NCCL INFO comm 0x
25a17aa0 rank 2 nranks 4 cudaDev 2 nvm
DEBUG 10-23 12:31:00 client.py:170] Waiting for output from MQLLMEngine.
INFO 10-23 12:31:04 distributed_gpu_executor.py:57] # GPU blocks: 9510, # CPU blocks: 6553
2024-10-23T19:31:04.365962961Z INFO 10-23 12:31:04 distributed_gpu_executor.py:61] Maximum concurrency for 6000 tokens per request: 25.36x
DEBUG 10-23 12:31:10 client.py:170] Waiting for output from MQLLMEngine.
DEBUG 10-23 12:31:14 engine.py:151] Starting Startup Loop.
DEBUG 10-23 12:31:14 engine.py:153] Starting heartbeat thread
DEBUG 10-23 12:31:14 engine.py:155] Starting Engine Loop.
INFO 10-23 12:31:14 api_server.py:232] vLLM to use /tmp/tmptyjvvvnc as PROMETHEUS_MULTIPROC_DIR
WARNING 10-23 12:31:14 serving_embedding.py:199] embedding_mode is False. Embedding API will not work.
2024-10-23T19:31:14.410901826Z INFO 10-23 12:31:14 launcher.py:19] Available routes are:
2024-10-23T19:31:14.410930098Z INFO 10-23 12:31:14 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 10-23 12:31:14 launcher.py:27] Route: /docs, Methods: GET, HEAD
2024-10-23T19:31:14.410971271Z INFO 10-23 12:31:14 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
2024-10-23T19:31:14.410989498Z INFO 10-23 12:31:14 launcher.py:27] Route: /redoc, Methods: GET, HEAD
2024-10-23T19:31:14.411007666Z INFO 10-23 12:31:14 launcher.py:27] Route: /health, Methods: GET
2024-10-23T19:31:14.411025063Z INFO 10-23 12:31:14 launcher.py:27] Route: /tokenize, Methods: POST
2024-10-23T19:31:14.411050915Z INFO 10-23 12:31:14 launcher.py:27] Route: /detokenize, Methods: POST
2024-10-23T19:31:14.411078352Z INFO 10-23 12:31:14 launcher.py:27] Route: /v1/models, Methods: GET
2024-10-23T19:31:14.411105313Z INFO 10-23 12:31:14 launcher.py:27] Route: /version, Methods: GET
2024-10-23T19:31:14.411152258Z INFO 10-23 12:31:14 launcher.py:27] Route: /v1/chat/completions, Methods: POST
2024-10-23T19:31:14.411152258Z INFO 10-23 12:31:14 launcher.py:27] Route: /v1/completions, Methods: POST
2024-10-23T19:31:14.411170893Z INFO 10-23 12:31:14 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7]
2024-10-23T19:31:14.427917191Z INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)
DEBUG 10-23 12:31:16 client.py:154] Heartbeat successful.
INFO:     127.0.0.6:60099 - "GET /health HTTP/1.1" 200 OK