Open chaser06 opened 8 months ago
Hey @chaser06, the most informative message is:
2024-02-21T17:14:06.758870Z WARN lorax_launcher: init.py:61 Could not import Flash Attention enabled models: CUDA is not available
This suggests that CUDA is not able to interface with the GPU device.
Can you share the output of running nvidia-smi
? There may be an issue with the system configuration.
Also, please try running and sharing the output of:
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.cuda.device_count())"
Outside the container everything looks good, I am using the same system to load models smoothly. The container is not able to access the GPUs somehow.
Hi, almost the same here with nvidia-smi:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB Off | 00000000:13:00.0 Off | 0 |
| N/A 55C P0 40W / 250W | 771MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1272 C /usr/local/bin/ollama 768MiB |
+-----------------------------------------------------------------------------------------+
but the interesting message (as above) for me is "The installed version of bitsandbytes was compiled without GPU support. ":
2024-03-18T17:53:47.978444Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
raise ImportError("CUDA is not available")
ImportError: CUDA is not available
2024-03-18T17:53:49.127116Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
config.json: 100%|██████████| 571/571 [00:00<00:00, 1.42MB/s]
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
raise ImportError("CUDA is not available")
ImportError: CUDA is not available
rank=0
2024-03-18T17:53:49.222957Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-18T17:53:49.223000Z INFO lorax_launcher: Shutting down shards
Hi, almost the same here with nvidia-smi:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla V100-PCIE-32GB Off | 00000000:13:00.0 Off | 0 | | N/A 55C P0 40W / 250W | 771MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1272 C /usr/local/bin/ollama 768MiB | +-----------------------------------------------------------------------------------------+
but the interesting message (as above) for me is "The installed version of bitsandbytes was compiled without GPU support. ":
2024-03-18T17:53:47.978444Z ERROR lorax_launcher: server.py:273 Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in <module> sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) > File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model from lorax_server.models.flash_mistral import FlashMistral File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module> from lorax_server.models.custom_modeling.flash_mistral_modeling import ( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module> from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2 File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module> raise ImportError("CUDA is not available") ImportError: CUDA is not available 2024-03-18T17:53:49.127116Z ERROR shard-manager: lorax_launcher: Shard complete standard error output: /opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " config.json: 100%|██████████| 571/571 [00:00<00:00, 1.42MB/s] Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in <module> sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model from lorax_server.models.flash_mistral import FlashMistral File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module> from lorax_server.models.custom_modeling.flash_mistral_modeling import ( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module> from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2 File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module> raise ImportError("CUDA is not available") ImportError: CUDA is not available rank=0 2024-03-18T17:53:49.222957Z ERROR lorax_launcher: Shard 0 failed to start 2024-03-18T17:53:49.223000Z INFO lorax_launcher: Shutting down shards
In my case, the issue was with podman command I used, can you share the command you used to run the container?
Same as above (I think we read the same article 😉) except I didn't use sudo (rootless) and I used the docker emulator of podman + I used TMPDIR and --tmpdir. What was your issue with the podman command?
Same as above (I think we read the same article 😉) except I didn't use sudo (rootless) and I used the docker emulator of podman + I used TMPDIR and --tmpdir. What was your issue with the podman command?
I am mentioning the command that worked for me: sudo podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v $volume:/data:Z ghcr.io/predibase/lorax:latest --model-id $model
Do let me know if it works for you.. 😁
@chaser06 With your adapted command (and w/o sudo), I get the following error:
$ podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all
I found on a French website (https://www.metal3d.org/blog/2023/podman-et-nvidia/) some settings so that podman can access the GPU:
# install nvidia container toolkit repo
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# then the toolkit
sudo dnf install nvidia-container-toolkit
# then create config file
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# fix SELinux issues
nvidia-container-cli -k list | sudo restorecon -v -f -
sudo restorecon -Rv /dev
and finally fix the content of /etc/nvidia-container-runtime/config.toml if you want to run rootless by changing below 2 lines:
[nvidia-container-cli]
#no-cgroups = false
no-cgroups = true
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "~/.local/nvidia-container-runtime.log"
Will try that and tell you
I found on a French website (https://www.metal3d.org/blog/2023/podman-et-nvidia/) some settings so that podman can access the GPU:
# install nvidia container toolkit repo curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo # then the toolkit sudo dnf install nvidia-container-toolkit # then create config file sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml # fix SELinux issues nvidia-container-cli -k list | sudo restorecon -v -f - sudo restorecon -Rv /dev
and finally fix the content of /etc/nvidia-container-runtime/config.toml if you want to run rootless by changing below 2 lines:
[nvidia-container-cli] #no-cgroups = false no-cgroups = true [nvidia-container-runtime] #debug = "/var/log/nvidia-container-runtime.log" debug = "~/.local/nvidia-container-runtime.log"
Will try that and tell you
SELinux is the major culprit all the time when you are getting any permission errors but here cuda issue is there so you need to install every needed package like cudatoolkit and also check if your driver and cuda version are compatible. After fixing the environment for all the cuda issues I tried the above command and it worked..The only difference is I ran it for rooot user.. Do let me know how you fix this..
I did the changes above and now I have the following errors:
If I use directly podman with your parameters, I get ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2
although I have CUDA 12 (see nvidia-smi above) :
>podman run -it --rm --security-opt=label=disable --device nvidia.com/gpu=all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data:Z ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
2024-03-19T13:05:58.074158Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "fa9dfeab772b", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-19T13:05:58.074370Z INFO download: lorax_launcher: Starting download process.
2024-03-19T13:06:02.533122Z INFO lorax_launcher: cli.py:110 Files are already present on the host. Skipping download.
2024-03-19T13:06:03.482529Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-19T13:06:03.482957Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-19T13:06:08.462649Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported
2024-03-19T13:06:09.590738Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 29, in <module>
raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported for Flash Attention V2
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 45, in <module>
raise ImportError(
ImportError: GPU with CUDA capability 7 0 is not supported
rank=0
2024-03-19T13:06:09.690182Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-19T13:06:09.690245Z INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart
and if I use the docker emulator, I get always the error The installed version of bitsandbytes was compiled without GPU support.
:
docker run --gpus all --shm-size 1g -p 8080:80 -v /opt/lorax/data:/data --tmpdir=/opt/lorax/tmp ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
2024-03-19T13:08:07.100243Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "1835cc35f075", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-19T13:08:07.100396Z INFO download: lorax_launcher: Starting download process.
2024-03-19T13:08:11.434525Z INFO lorax_launcher: cli.py:110 Files are already present on the host. Skipping download.
2024-03-19T13:08:12.305634Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-19T13:08:12.306139Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-19T13:08:17.173901Z ERROR lorax_launcher: server.py:273 Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
raise ImportError("CUDA is not available")
ImportError: CUDA is not available
2024-03-19T13:08:18.316835Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 317, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 269, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 21, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 32, in <module>
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/flash_attn.py", line 10, in <module>
raise ImportError("CUDA is not available")
ImportError: CUDA is not available
rank=0
2024-03-19T13:08:18.410775Z ERROR lorax_launcher: Shard 0 failed to start
2024-03-19T13:08:18.410828Z INFO lorax_launcher: Shutting down shards
Error: ShardCannotStart
but it appears when checking Attention V2:
from lorax_server.utils.flash_attn import HAS_FLASH_ATTN_V2
Any clue on Cuda setup welcome
BTW:
> python -c "import torch; print(torch.cuda.is_available())"
True
> python -c "import torch; print(torch.cuda.device_count())"
1
so I have cuda working
System Info:
Python - 3.11.5 Cuda - 12.2 GPU: A100, Driver Version: 535.104.05
GPU - 2
Command used
model=mistralai/Mistral-7B-Instruct-v0.1 volume=$PWD/data sudo podman run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data:Z ghcr.io/predibase/lorax:latest --model-id $model
Error
2024-02-21T17:13:58.822983Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "151fb0325992", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2024-02-21T17:13:58.823128Z INFO download: lorax_launcher: Starting download process. 2024-02-21T17:14:02.413861Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.
2024-02-21T17:14:03.328414Z INFO download: lorax_launcher: Successfully downloaded weights. 2024-02-21T17:14:03.329664Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2024-02-21T17:14:06.758870Z WARN lorax_launcher: init.py:61 Could not import Flash Attention enabled models: CUDA is not available
2024-02-21T17:14:06.761425Z WARN lorax_launcher: init.py:77 Could not import Mistral model: CUDA is not available
2024-02-21T17:14:06.805670Z WARN lorax_launcher: init.py:84 Could not import Mixtral model: CUDA is not available
2024-02-21T17:14:07.178398Z ERROR lorax_launcher: server.py:255 Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(*use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, self._args)
2024-02-21T17:14:08.236535Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 251, in serve_inner model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model raise NotImplementedError("Mistral model requires flash attention v2")
NotImplementedError: Mistral model requires flash attention v2 rank=0 2024-02-21T17:14:08.334432Z ERROR lorax_launcher: Shard 0 failed to start 2024-02-21T17:14:08.334467Z INFO lorax_launcher: Shutting down shards Error: ShardCannotStart
Questions Please let me know if any additional setup needs to be done before running this container. I have checked nvcc is working properly in my system. If no other setups are needed then please help me fix this issue. I have tested these same commands for different models and images. Every time I got same error. If you need any more info please let me know. Thankyou!