[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error

oe3gwu commented 3 months ago

Your current environment

Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39 Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB Nvidia driver version: 535.161.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz CPU family: 6 Model: 62 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 Stepping: 4 CPU(s) scaling MHz: 89% CPU max MHz: 3100.0000 CPU min MHz: 1200.0000 BogoMIPS: 5199.89 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d Virtualization: VT-x L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 3 MiB (12 instances) L3 cache: 30 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] No relevant packages [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 6-11,18-23 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I wrote a simple docker-compose.yml that install vLLM and downloads mistral 7b. That worked. The --dtype=half is needed for the P100. However, after Mistral is downloaded, the container crashes with a CUDA error. As far as I understand, CUDA is deployed within the container. So that is nothing I can do about it.

name: vllm
services:
    vllm-app:
        container_name: vllm-app
        runtime: nvidia
        deploy:
            resources:
                reservations:
                    devices:
                        - driver: nvidia
                          count: all
                          capabilities:
                              - gpu
        volumes:
            - ./vllm/cache/huggingface:/root/.cache/huggingface
        environment:
            - HUGGING_FACE_HUB_TOKEN=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
        ports:
            - 8000:8000
        restart: unless-stopped
        pull_policy: always
        ipc: host
        image: vllm/vllm-openai:latest
        command: --model mistralai/Mistral-7B-v0.3 --dtype=half
        #command: --model facebook/opt-125m --enforce-eager
        #command: --model google/gemma-2b --dtype=float16

Error Log:

docker.log

Running Mistral 7B using Ollama works fine.

robertgshaw2-neuralmagic commented 3 months ago

We do not support P100

dirkson commented 1 month ago

We do not support P100

I am confused at this reply and the closure of this ticket, as I believe there was (and is!) an open PR that adds support for this. https://github.com/vllm-project/vllm/pull/4409

Is there some non-obvious issue with the PR?

robertgshaw2-neuralmagic commented 1 month ago

We do not support P100

I am confused at this reply and the closure of this ticket, as I believe there was (and is!) an open PR that adds support for this. #4409

Is there some non-obvious issue with the PR?

The decision to not ship a P100 distribution was driven by 3 factors:

shipping P100 increases our binary size and we are already close to the PyPI limits as is
we do not have access to P100 resources to run our CI and have no way to test
it is a relatively easy for users to build vLLM with support for P100 and so there is a relatively painless workaround for motivated users

dirkson commented 1 month ago

The decision to not ship a P100 distribution was driven by 3 factors:

Where did you have this discussion? It doesn't seem to be on the PR I linked, and I haven't been able to find it with a casual search.

* shipping P100 increases our binary size and we are already close to the PyPI limits as is

The PR in question mentions https://github.com/pypi/support/issues/3792 , which appears to up the limit to 400mb from 100. I think this is a solved issue?

* we do not have access to P100 resources to run our CI and have no way to test

I am surprised that you don't have access to p100 hardware. As I understand, it's literally the cheapest inference hardware at the moment. I'm sure that the community would respond if you requested access to someone else's P100's for testing.

* it is a relatively easy for users to build vLLM with support for P100 and so there is a relatively painless workaround for motivated users

Perhaps add this conclusion to your documentation? I had to hunt through numerous bugs and find a PR to figure out how to potentially run VLLM. If the only (un-? semi-?) supported way to run it on common hardware is via a third party repo, it seems reasonable to mention that in some detailed install documentation.

My apologies if this reply comes off a little grumpy. I get that our goals aren't really aligned here, and that p100 support seems silly from a business perspective, despite the relative accessibility of the hardware.

oe3gwu commented 1 month ago

I must agree to @dirkson . Doc how to compile vLLM yourself with P100 support is virtually non existent. And if it is so easy, why you simply dont do it from your project with the info untested? If you dont have the hardware, there are people out there to test it.

I want it for my private use, but actually this Bug Report made me use Ollama, because the argument that you dont have P100's is just a pseudo-argument. A 100 USD invest in eBay will let you get one.

Also I had entire schools (3 schools with approx. 40 cards each) that we installed now with ollama and open-webui, because vLLM (which would be the superrior tech) simply cant support it. And I am just 1 person who would have needed that support.

Edit - if this post comes a 2nd time I am sorry, I tried via eMail but didnt gone through until now.

vllm-project / vllm

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

Your current environment

🐛 Describe the bug