vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.32k stars 4.01k forks source link

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1 #6305

Open rajagond opened 2 months ago

rajagond commented 2 months ago

I am trying to run inference with mistralai/Mixtral-8x22B-v0.1 model, but it is generating random output with an 8-way tensor parallel setup. Below are the details of the configuration and I think there may be an issue with the tokenizer.

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="mistralai/Mixtral-8x22B-v0.1", tensor_parallel_size=8, enforce_eager=True, load_format="auto")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
mgoin commented 2 months ago

Please share the output from collect_env.py

rajagond commented 2 months ago

PyTorch version: 2.2.1 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.26.0 Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1045-azure-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.86.10 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8480C Stepping: 8 CPU MHz: 2000.000 BogoMIPS: 4000.00 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 4.5 MiB L1i cache: 3 MiB L2 cache: 192 MiB L3 cache: 210 MiB NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Retbleed: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 avx512vbmi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] onnx==1.15.0 [pip3] onnxruntime-training==1.17.1 [pip3] pytorch-lightning==1.9.5 [pip3] torch==2.2.1 [pip3] torch-nebula==0.16.10 [pip3] torch-ort==1.17.0 [pip3] torchaudio==2.2.1+cu121 [pip3] torchdata==0.7.1 [pip3] torchmetrics==1.2.0 [pip3] torchsnapshot==0.1.0 [pip3] torchvision==0.17.1+cu121 [pip3] triton==2.2.0 [conda] magma-cuda121 2.6.1 1 pytorch [conda] mkl 2022.2.1 pypi_0 pypi [conda] mkl-include 2022.2.1 pypi_0 pypi [conda] numpy 1.24.4 pypi_0 pypi [conda] pytorch-lightning 1.9.5 pypi_0 pypi [conda] torch 2.2.1 pypi_0 pypi [conda] torch-nebula 0.16.10 pypi_0 pypi [conda] torch-ort 1.17.0 pypi_0 pypi [conda] torchaudio 2.2.1+cu121 pypi_0 pypi [conda] torchdata 0.7.1 pypi_0 pypi [conda] torchmetrics 1.2.0 pypi_0 pypi [conda] torchsnapshot 0.1.0 pypi_0 pypi [conda] torchvision 0.17.1+cu121 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE SYS NODE SYS NODE NODE SYS SYS NODE 0-47 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS NODE SYS NODE SYS SYS NODE NODE SYS 48-95 1 N/A NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X SYS NODE SYS NODE NODE SYS SYS NODE NIC1 SYS SYS SYS SYS NODE NODE NODE NODE SYS X SYS NODE SYS SYS NODE NODE SYS NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS X SYS NODE NODE SYS SYS NODE NIC3 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS X SYS SYS NODE NODE SYS NIC4 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS X NODE SYS SYS NODE NIC5 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS NODE X SYS SYS NODE NIC6 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS NODE SYS SYS X NODE SYS NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS NODE SYS NODE SYS SYS NODE X SYS NIC8 NODE NODE NODE NODE SYS SYS SYS SYS NODE SYS NODE SYS NODE NODE SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8

simon-mo commented 2 months ago

You might need the instruction tuned model instead of the base model: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

JD-P commented 2 months ago

I'm having a similar issue, and I know it's not that I need the instruction tuned model because I recently made a synthetic dataset with the base model. For a prompt like:

I will never ever forget that in 2017 when Petscop 6 was written if your computer displayed comparable capabilities to GPT-2 it was considered epistemically permissible to conclude that your computer is supernaturally possessed and nobody seriously objected to this.

I'm getting nonsense outputs like:

gegründabile gegründ[control_234][control_129][control_152] Orts[control_230][control_137][control_131][control_234][control_129][control_152] Orts[control_230][control_137][control_131] Census Ham[control_11] memorial voegen[control_11] warehouse[control_297]ott[control_11]пня[control_536][control_11][control_260]cc[control_636][control_11][control_301]іль[control_286][control_11] XI[control_761] person[control_11][control_301][control_337][control_11]lte[control_421][control_11] memorial agrun[control_284][control_11] memorialemplates memorial[control_11] memorial[control_321][control_11]lte[control_260][control_11][control_368]Own[control_11] XIastic[control_11] memorial[control_492][control_740][control_11][control_485][control_11][control_485]ans[control_11] XIastic[control_11][control_368][control_379]E[control_11][control_265]ні[control_11] Quinn[control_490][control_11][control_301][control_319]elve[control_11][control_368] shirt[control_520][control_11] controversy[control_11] controversy 

I think this is related to some kind of tokenizer update that Mistral recently did for the model:

https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/commit/56c5f47124733df7d4a5e2fc083c8ce9a950ab99