mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.72k stars 862 forks source link

Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW #144

Open guidoveritone opened 7 months ago

guidoveritone commented 7 months ago

Hey guys, i am trying to run the Mistral 7b model using the guide on the page.

I am running:

docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/mistralai/mistral-src/vllm:latest \
    --host 0.0.0.0 \
    --model mistralai/Mistral-7B-Instruct-v0.2

and I am getting the following error:

└─$ docker run --gpus '"device=0"' -e HF_TOKEN=$HF_TOKEN -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest --host 0.0.0.0 --model mistralai/Mistral-7B-Instruct-v0.2
The HF_TOKEN environment variable set, logging to Hugging Face.
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
INFO 04-15 15:25:32 api_server.py:719] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
config.json: 100%|██████████| 596/596 [00:00<00:00, 6.74MB/s]
INFO 04-15 15:25:33 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 19.4MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 9.14MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 3.16MB/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 953kB/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 729, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers(distributed_init_method)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 141, in _init_workers
    self._run_workers(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 59, in init_model
    torch.cuda.set_device(self.device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

I tried several things to fix this, found things to do on

and nothing worked! also tried some nvidia default containers to check if CUDA is working and everything seems to work!

my nvidia-smi output:

└─$ nvidia-smi
Mon Apr 15 12:29:36 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 30%   45C    P0    58W / 170W |    490MiB / 12288MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1800      G   /usr/lib/xorg/Xorg                179MiB |
|    0   N/A  N/A      1999      G   /usr/bin/gnome-shell               47MiB |
|    0   N/A  N/A      2214      G   /usr/bin/nvidia-settings            0MiB |
|    0   N/A  N/A      2757      G   ...--variations-seed-version       45MiB |
|    0   N/A  N/A      2900      G   ...b/firefox-esr/firefox-esr      114MiB |
|    0   N/A  N/A      3904      G   ...on=20240414-180149.278000       98MiB |
+-----------------------------------------------------------------------------+

my /etc/nvidia-container-runtime/config.toml file:

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

note: if I change the no-cgroups flag to true I get a No CUDA Gpus available error.

OS:

└─$ neofetch 
..............                                     guido@kali 
            ..,;:ccc,.                             ---------- 
          ......''';lxO.                           OS: Kali GNU/Linux Rolling x86_64 
.....''''..........,:ld;                           Kernel: 6.6.9-amd64 
           .';;;:::;,,.x,                          Uptime: 37 mins 
      ..'''.            0Xxoc:,.  ...              Packages: 2981 (dpkg), 12 (snap) 
  ....                ,ONkc;,;cokOdc',.            Shell: bash 5.2.21 
 .                   OMo           ':ddo.          Resolution: 1920x1080, 1920x1080 
                    dMc               :OO;         DE: GNOME 45.3 
                    0M.                 .:o.       WM: Mutter 
                    ;Wd                            WM Theme: Kali-Purple-Dark 
                     ;XO,                          Theme: Kali-Purple-Dark [GTK2/3] 
                       ,d0Odlc;,..                 Icons: Flat-Remix-Blue-Light [GTK2/3] 
                           ..',;:cdOOd::,.         Terminal: terminator 
                                    .:d;.':;.      CPU: AMD Ryzen 7 5800X (16) @ 4.200GHz 
                                       'd,  .'     GPU: NVIDIA GeForce RTX 3060 Lite Hash Rate 
                                         ;l   ..   Memory: 4669MiB / 32013MiB 
                                          .o
                                            c                              
                                            .'                             
                                             .
guidoveritone commented 7 months ago

also, if i run an example cuda container to check everything works... everything works:

└─$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Mon Apr 15 15:53:55 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 30%   45C    P0    57W / 170W |    383MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
kubeqer commented 3 months ago

i have similiar issue, have you managed to solve it somehow?