vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.18k stars 4.56k forks source link

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1 #6305

Open rajagond opened 4 months ago

rajagond commented 4 months ago

I am trying to run inference with mistralai/Mixtral-8x22B-v0.1 model, but it is generating random output with an 8-way tensor parallel setup. Below are the details of the configuration and I think there may be an issue with the tokenizer.

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="mistralai/Mixtral-8x22B-v0.1", tensor_parallel_size=8, enforce_eager=True, load_format="auto")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
mgoin commented 4 months ago

Please share the output from collect_env.py

rajagond commented 4 months ago

PyTorch version: 2.2.1 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.0 Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1045-azure-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.86.10 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8480C Stepping: 8 CPU MHz: 2000.000 BogoMIPS: 4000.00 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 4.5 MiB L1i cache: 3 MiB L2 cache: 192 MiB L3 cache: 210 MiB NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Retbleed: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 avx512vbmi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] onnx==1.15.0 [pip3] onnxruntime-training==1.17.1 [pip3] pytorch-lightning==1.9.5 [pip3] torch==2.2.1 [pip3] torch-nebula==0.16.10 [pip3] torch-ort==1.17.0 [pip3] torchaudio==2.2.1+cu121 [pip3] torchdata==0.7.1 [pip3] torchmetrics==1.2.0 [pip3] torchsnapshot==0.1.0 [pip3] torchvision==0.17.1+cu121 [pip3] triton==2.2.0 [conda] magma-cuda121 2.6.1 1 pytorch [conda] mkl 2022.2.1 pypi_0 pypi [conda] mkl-include 2022.2.1 pypi_0 pypi [conda] numpy 1.24.4 pypi_0 pypi [conda] pytorch-lightning 1.9.5 pypi_0 pypi [conda] torch 2.2.1 pypi_0 pypi [conda] torch-nebula 0.16.10 pypi_0 pypi [conda] torch-ort 1.17.0 pypi_0 pypi [conda] torchaudio 2.2.1+cu121 pypi_0 pypi [conda] torchdata 0.7.1 pypi_0 pypi [conda] torchmetrics 1.2.0 pypi_0 pypi [conda] torchsnapshot 0.1.0 pypi_0 pypi [conda] torchvision 0.17.1+cu121 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X SYS NODE SYS NODE NODE SYS SYS NODE NIC1 SYS SYS SYS SYS NODE NODE NODE NODE SYS X SYS NODE SYS SYS NODE NODE SYS NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS X SYS NODE NODE SYS SYS NODE NIC3 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS X SYS SYS NODE NODE SYS NIC4 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS X NODE SYS SYS NODE NIC5 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS NODE X SYS SYS NODE NIC6 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS NODE SYS SYS X NODE SYS NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS NODE SYS SYS NODE X SYS NIC8 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS NODE NODE SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8

simon-mo commented 4 months ago

You might need the instruction tuned model instead of the base model: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

JD-P commented 4 months ago

I'm having a similar issue, and I know it's not that I need the instruction tuned model because I recently made a synthetic dataset with the base model. For a prompt like:

I will never ever forget that in 2017 when Petscop 6 was written if your computer displayed comparable capabilities to GPT-2 it was considered epistemically permissible to conclude that your computer is supernaturally possessed and nobody seriously objected to this.

I'm getting nonsense outputs like:

gegründabile gegründ[control_234][control_129][control_152] Orts[control_230][control_137][control_131][control_234][control_129][control_152] Orts[control_230][control_137][control_131] Census Ham[control_11] memorial voegen[control_11] warehouse[control_297]ott[control_11]пня[control_536][control_11][control_260]cc[control_636][control_11][control_301]іль[control_286][control_11] XI[control_761] person[control_11][control_301][control_337][control_11]lte[control_421][control_11] memorial agrun[control_284][control_11] memorialemplates memorial[control_11] memorial[control_321][control_11]lte[control_260][control_11][control_368]Own[control_11] XIastic[control_11] memorial[control_492][control_740][control_11][control_485][control_11][control_485]ans[control_11] XIastic[control_11][control_368][control_379]E[control_11][control_265]ні[control_11] Quinn[control_490][control_11][control_301][control_319]elve[control_11][control_368] shirt[control_520][control_11] controversy[control_11] controversy 

I think this is related to some kind of tokenizer update that Mistral recently did for the model:

https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/commit/56c5f47124733df7d4a5e2fc083c8ce9a950ab99

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!