Open YuWang916 opened 3 months ago
This is a good issue to work on!
@YuWang916 I had actually implemented support in the engine for this at some point, but wasn't sure if there was any interest in this feature beyond the support in the OpenAI entrypoint. I can try to dig out those commits - probably the code on main has changed a bit in the meantime (๐) so might need some work to rebase it.
@tdoublep Thank you! I currently have a workaround to tokenize and truncate beforehand and pass to the parameter prompt_token_ids
. But would be nice to have this functionality in the vllm engine!
I think #3512 should make this easier by using the same tokenizer for LLMEngine
and OpenAIServing
.
Your current environment
Collecting environment information... PyTorch version: 2.2.1+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A
OS: CBL-Mariner/Linux (x86_64) GCC version: (GCC) 11.2.0 Clang version: Could not collect CMake version: version 3.21.4 Libc version: glibc-2.35
Python version: 3.10.2 (main, Feb 22 2024, 00:00:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.138.1-4.cm2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB
Nvidia driver version: 525.85.12 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.9.5 /usr/lib/libcudnn_adv_infer.so.8.9.5 /usr/lib/libcudnn_adv_train.so.8.9.5 /usr/lib/libcudnn_cnn_infer.so.8.9.5 /usr/lib/libcudnn_cnn_train.so.8.9.5 /usr/lib/libcudnn_ops_infer.so.8.9.5 /usr/lib/libcudnn_ops_train.so.8.9.5 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD Model name: AMD EPYC 7763 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 3529.0520 CPU min MHz: 1500.0000 BogoMIPS: 4899.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca Virtualization: AMD-V L1d cache: 4 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 64 MiB (128 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-15,128-143 NUMA node1 CPU(s): 16-31,144-159 NUMA node2 CPU(s): 32-47,160-175 NUMA node3 CPU(s): 48-63,176-191 NUMA node4 CPU(s): 64-79,192-207 NUMA node5 CPU(s): 80-95,208-223 NUMA node6 CPU(s): 96-111,224-239 NUMA node7 CPU(s): 112-127,240-255 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] flake8==4.0.1.1 [pip3] flake8-annotations-complexity==0.0.6.2 [pip3] flake8-bugbear==20.1.4 [pip3] flake8-builtins==1.4.2 [pip3] flake8-pie==0.5.0.1 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.24.3 [pip3] nvidia-nccl-cu11==2.19.3 [pip3] pytorch-lightning==2.2.3 [pip3] torch==2.2.1+cu118 [pip3] torch-lib==0.1.25 [pip3] torchmetrics==1.3.1 [pip3] triton==2.2.0 [pip3] vllm-nccl-cu11==2.18.1.0.4.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 48-63,176-191 3 GPU1 NV12 X NV12 NV12 48-63,176-191 3 GPU2 NV12 NV12 X NV12 16-31,144-159 1 GPU3 NV12 NV12 NV12 X 80-95,208-223 5
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
๐ Describe the bug
Below is the code running on 1 node with 4 A100 GPUs:
Output:
I also searched the vLLM codebase,
truncate_prompt_tokens
seems to be not existent in the vLLM engine code.Is there a workaround for this? Do I have to truncate manually before running the vLLM engine?
Thanks in advance!