predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
2.15k stars 142 forks source link

Error while running the pre-built container using Podman #266

Open chaser06 opened 8 months ago

chaser06 commented 8 months ago

System Info:

Python - 3.11.5 Cuda - 12.2 GPU: A100, Driver Version: 535.104.05

GPU - 2

Command used

model=mistralai/Mistral-7B-Instruct-v0.1 volume=$PWD/data sudo podman run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data:Z ghcr.io/predibase/lorax:latest --model-id $model

Error

2024-02-21T17:13:58.822983Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "151fb0325992", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2024-02-21T17:13:58.823128Z INFO download: lorax_launcher: Starting download process. 2024-02-21T17:14:02.413861Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.

2024-02-21T17:14:03.328414Z INFO download: lorax_launcher: Successfully downloaded weights. 2024-02-21T17:14:03.329664Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2024-02-21T17:14:06.758870Z WARN lorax_launcher: init.py:61 Could not import Flash Attention enabled models: CUDA is not available

2024-02-21T17:14:06.761425Z WARN lorax_launcher: init.py:77 Could not import Mistral model: CUDA is not available

2024-02-21T17:14:06.805670Z WARN lorax_launcher: init.py:84 Could not import Mixtral model: CUDA is not available

2024-02-21T17:14:07.178398Z ERROR lorax_launcher: server.py:255 Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 251, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model raise NotImplementedError("Mistral model requires flash attention v2") NotImplementedError: Mistral model requires flash attention v2

2024-02-21T17:14:08.236535Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve server.serve(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 251, in serve_inner model = get_model(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model raise NotImplementedError("Mistral model requires flash attention v2")

NotImplementedError: Mistral model requires flash attention v2 rank=0 2024-02-21T17:14:08.334432Z ERROR lorax_launcher: Shard 0 failed to start 2024-02-21T17:14:08.334467Z INFO lorax_launcher: Shutting down shards Error: ShardCannotStart

Questions Please let me know if any additional setup needs to be done before running this container. I have checked nvcc is working properly in my system. If no other setups are needed then please help me fix this issue. I have tested these same commands for different models and images. Every time I got same error. If you need any more info please let me know. Thankyou!

tgaddair commented 8 months ago

Hey @chaser06, the most informative message is:

2024-02-21T17:14:06.758870Z WARN lorax_launcher: init.py:61 Could not import Flash Attention enabled models: CUDA is not available

This suggests that CUDA is not able to interface with the GPU device.

Can you share the output of running nvidia-smi? There may be an issue with the system configuration.

Also, please try running and sharing the output of:

python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.device_count())"
chaser06 commented 8 months ago

image

Outside the container everything looks good, I am using the same system to load models smoothly. The container is not able to access the GPUs somehow.

leolivier commented 7 months ago

Hi, almost the same here with nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:13:00.0 Off |                    0 |
| N/A   55C    P0             40W /  250W |     771MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1272      C   /usr/local/bin/ollama                         768MiB |
+-----------------------------------------------------------------------------------------+

but the interesting message (as above) for me is "The installed version of bitsandbytes was compiled without GPU support. ":

2024-03-18T17:53:47.978444Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")
ImportError: CUDA is not available

2024-03-18T17:53:49.127116Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.42MB/s]
Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")

ImportError: CUDA is not available
 rank=0
2024-03-18T17:53:49.222957Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-18T17:53:49.223000Z  INFO lorax_launcher: Shutting down shards
chaser06 commented 7 months ago

Hi, almost the same here with nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:13:00.0 Off |                    0 |
| N/A   55C    P0             40W /  250W |     771MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1272      C   /usr/local/bin/ollama                         768MiB |
+-----------------------------------------------------------------------------------------+

but the interesting message (as above) for me is "The installed version of bitsandbytes was compiled without GPU support. ":

2024-03-18T17:53:47.978444Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")
ImportError: CUDA is not available

2024-03-18T17:53:49.127116Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.42MB/s]
Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")

ImportError: CUDA is not available
 rank=0
2024-03-18T17:53:49.222957Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-18T17:53:49.223000Z  INFO lorax_launcher: Shutting down shards

In my case, the issue was with podman command I used, can you share the command you used to run the container?

leolivier commented 7 months ago

Same as above (I think we read the same article 😉) except I didn't use sudo (rootless) and I used the docker emulator of podman + I used TMPDIR and --tmpdir. What was your issue with the podman command?

chaser06 commented 7 months ago

Same as above (I think we read the same article 😉) except I didn't use sudo (rootless) and I used the docker emulator of podman + I used TMPDIR and --tmpdir. What was your issue with the podman command?

I am mentioning the command that worked for me: sudo podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v $volume:/data:Z ghcr.io/predibase/lorax:latest --model-id $model

Do let me know if it works for you.. 😁

leolivier commented 7 months ago

@chaser06 With your adapted command (and w/o sudo), I get the following error:

$ podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
leolivier commented 7 months ago

I found on a French website (https://www.metal3d.org/blog/2023/podman-et-nvidia/) some settings so that podman can access the GPU:

# install nvidia container toolkit repo
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# then the toolkit
sudo dnf install nvidia-container-toolkit

# then create config file
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# fix SELinux issues
nvidia-container-cli -k list | sudo restorecon -v -f -
sudo restorecon -Rv /dev

and finally fix the content of /etc/nvidia-container-runtime/config.toml if you want to run rootless by changing below 2 lines:

[nvidia-container-cli]
#no-cgroups = false
no-cgroups = true

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "~/.local/nvidia-container-runtime.log"

Will try that and tell you

chaser06 commented 7 months ago

I found on a French website (https://www.metal3d.org/blog/2023/podman-et-nvidia/) some settings so that podman can access the GPU:

# install nvidia container toolkit repo
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# then the toolkit
sudo dnf install nvidia-container-toolkit

# then create config file
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# fix SELinux issues
nvidia-container-cli -k list | sudo restorecon -v -f -
sudo restorecon -Rv /dev

and finally fix the content of /etc/nvidia-container-runtime/config.toml if you want to run rootless by changing below 2 lines:

[nvidia-container-cli]
#no-cgroups = false
no-cgroups = true

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "~/.local/nvidia-container-runtime.log"

Will try that and tell you

SELinux is the major culprit all the time when you are getting any permission errors but here cuda issue is there so you need to install every needed package like cudatoolkit and also check if your driver and cuda version are compatible. After fixing the environment for all the cuda issues I tried the above command and it worked..The only difference is I ran it for rooot user.. Do let me know how you fix this..

leolivier commented 7 months ago

I did the changes above and now I have the following errors: If I use directly podman with your parameters, I get ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2 although I have CUDA 12 (see nvidia-smi above) :

>podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
2024-03-19T13:05:58.074158Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "fa9dfeab772b", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-19T13:05:58.074370Z  INFO download: lorax_launcher: Starting download process.
2024-03-19T13:06:02.533122Z  INFO lorax_launcher: cli.py:110 Files are already present on the host. Skipping download.

2024-03-19T13:06:03.482529Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-19T13:06:03.482957Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-19T13:06:08.462649Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
    raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported

2024-03-19T13:06:09.590738Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
    raise ImportError(

ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
    raise ImportError(

ImportError: GPU with CUDA capability 7 0 is not supported
 rank=0
2024-03-19T13:06:09.690182Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-19T13:06:09.690245Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

and if I use the docker emulator, I get always the error The installed version of bitsandbytes was compiled without GPU support. :

docker run --gpus all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data --tmpdir=/opt/lorax/tmp ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
2024-03-19T13:08:07.100243Z  INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "1835cc35f075", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-19T13:08:07.100396Z  INFO download: lorax_launcher: Starting download process.
2024-03-19T13:08:11.434525Z  INFO lorax_launcher: cli.py:110 Files are already present on the host. Skipping download.

2024-03-19T13:08:12.305634Z  INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-19T13:08:12.306139Z  INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-19T13:08:17.173901Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")
ImportError: CUDA is not available

2024-03-19T13:08:18.316835Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):

  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
    from lorax_server.models.flash_mistral import FlashMistral

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
    from lorax_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
    from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
    raise ImportError("CUDA is not available")

ImportError: CUDA is not available
 rank=0
2024-03-19T13:08:18.410775Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-19T13:08:18.410828Z  INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart

but it appears when checking Attention V2: from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2

Any clue on Cuda setup welcome

leolivier commented 7 months ago

BTW:

> python -c "import torch; print(torch.cuda.is_available())"
True
> python -c "import torch; print(torch.cuda.device_count())"
1

so I have cuda working