textgen-nvidia on WSL no longer sees GPU

alexvorobiev commented 8 months ago

I have tried to run the flake on WSL NixOS and it looks like pytorch no longer recognizes the GPU (RTX 3070). It used to work a few months ago. nvtop works as expected.

$ nix run --impure github:nixified-ai/flake#textgen-nvidia
Running via WSL (Windows Subsystem for Linux), setting LD_LIBRARY_PATH
+ export LD_LIBRARY_PATH=/usr/lib/wsl/lib
+ LD_LIBRARY_PATH=/usr/lib/wsl/lib
+ set +x
False
/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
2024-03-20 01:17:09 ERROR:Could not find the character "Alpaca" inside instruction-templates/. No character has been loaded.
Traceback (most recent call last):
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/gradio/routes.py", line 414, in run_predict
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2106, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 833, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/1zzqmn5cl5c8dcbv37xp8xvvii892015-textgen-patchedSrc/modules/chat.py", line 561, in load_character
    raise ValueError
ValueError
2024-03-20 01:17:43 INFO:Loading HuggingFaceH4_zephyr-7b-beta...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 8/8 [38:34<00:00, 289.36s/it]
2024-03-20 01:56:18 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "/nix/store/1zzqmn5cl5c8dcbv37xp8xvvii892015-textgen-patchedSrc/modules/ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/1zzqmn5cl5c8dcbv37xp8xvvii892015-textgen-patchedSrc/modules/models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/1zzqmn5cl5c8dcbv37xp8xvvii892015-textgen-patchedSrc/modules/models.py", line 141, in huggingface_loader
    model = model.cuda()
            ^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2243, in cuda
    return super().cuda(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 905, in cuda
    return self._apply(lambda t: t.cuda(device))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 905, in <lambda>
    return self._apply(lambda t: t.cuda(device))
                                 ^^^^^^^^^^^^^^
  File "/nix/store/7hpffz24mjm12y5ymd2is43lxl7nf27b-python3-3.11.6-env/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

MatthewCroughan commented 8 months ago

Did WSL stop exposing Nvidia drivers via /usr/lib/wsl/lib? Did you update the nvidia driver on the Windows installation? I need a bit more information.

alexvorobiev commented 8 months ago

WSL does expose the drivers (I mentioned that nvtop works):

$ ls -l /usr/lib/wsl/lib
total 224076
-r-xr-xr-x 1 root root  10524136 Feb 15 10:42 libcudadebugger.so.1
-r-xr-xr-x 1 root root    162552 Mar  1 17:04 libcuda.so
-r-xr-xr-x 1 root root    162552 Mar  1 17:04 libcuda.so.1
-r-xr-xr-x 1 root root    162552 Mar  1 17:04 libcuda.so.1.1
-r-xr-xr-x 1 root root   6880344 Oct 20 00:13 libd3d12core.so
-r-xr-xr-x 1 root root    801840 Oct 20 00:13 libd3d12.so
-r-xr-xr-x 1 root root    829248 Jun  1  2022 libdxcore.so
-r-xr-xr-x 1 root root  11742584 Mar  1 17:04 libnvcuvid.so
-r-xr-xr-x 1 root root  11742584 Mar  1 17:04 libnvcuvid.so.1
-r-xr-xr-x 1 root root 115888416 Feb 15 10:42 libnvdxdlkernels.so
-r-xr-xr-x 1 root root    572008 Mar  1 17:04 libnvidia-encode.so
-r-xr-xr-x 1 root root    572008 Mar  1 17:04 libnvidia-encode.so.1
-r-xr-xr-x 1 root root    244344 Feb 15 10:42 libnvidia-ml.so.1
-r-xr-xr-x 1 root root    362960 Mar  1 17:04 libnvidia-opticalflow.so
-r-xr-xr-x 1 root root    362960 Mar  1 17:04 libnvidia-opticalflow.so.1
-r-xr-xr-x 1 root root     72656 Feb 15 10:42 libnvoptix.so.1
-r-xr-xr-x 1 root root  67625384 Mar  1 17:04 libnvwgf2umx.so
-r-xr-xr-x 1 root root    715296 Mar  1 17:04 nvidia-smi

env NIXPKGS_ALLOW_UNFREE=1 LD_LIBRARY_PATH=/usr/lib/wsl/lib nix run --impure nixpkgs#nvtop

 Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 4@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
 GPU 1815MHz MEM 7000MHz TEMP  35°C FAN   0% POW  35 / 240 W
 GPU[                                 0%] MEM[||||||||            1.998Gi/8.000Gi]
Setup       Select Monitored GPUs
General     [*] NVIDIA GeForce RTX 3070
Devices
Chart
Processes
GPU Select

The Windows driver is 551.76

MatthewCroughan commented 8 months ago

@alexvorobiev you mentioned that it "no longer" sees the GPU. Nix isn't defining your Nvidia GPU, it is an impurity. Therefore, I need to know if you upgraded your Nvidia driver, manually, by hand, on the Windows host or not, and if so what the version jump is.

alexvorobiev commented 8 months ago

It is hard to say what the jump was. The drivers are updated by the Nvidia Geforce Experience app which offers new drivers fairly often. This PC is also used for gaming, so I update the drivers whenever the updates are available in the app. Today it offers 551.86. Is this message the source of the problem ""The installed version of bitsandbytes was compiled without GPU support. "?

nixified-ai / flake

textgen-nvidia on WSL no longer sees GPU #90