Closed alexg711 closed 10 months ago
I don't know if bitsandbytes will work with Mixtral. That would be question for bitsandbytes team.
I recommend trying AWQ instead once it's ready: https://github.com/casper-hansen/AutoAWQ/issues/259
Or you can try GPTQ: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
Or GGUF: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
So there are many options.
For bits and bytes, try 4-bit instead of 8-bit:
https://huggingface.co/blog/mixtral#load-mixtral-with-4-bit-quantization
And you can try the different low_bit_mode
values to see what works best. The equivalent to the blog above would be --low_bit_mode=1 --load_4bit=True
low_bit_mode=1 and load_4bit=true fixed my issue thank you!!
That said when I try to load the gptq it just fills up a single one of my 4 gpus what are the correct docker run settings to balance it or can I specify the amount of vram per gpu?
(Sorry for the dumb questions and thank you for all of the suggestions, hard work and help)
I want to try the gptq or gguf versions but i cant figure out how to get them to load. the gguf errors out as below and the gptq fills one of my 4 gpus up and error oom.
Command:
sudo docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u id -u
:id -g
\
-v "${HOME}"/.cache:/workspace/.cache \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
--base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \
--use_safetensors=True \
--prompt_type=mistral \
--save_dir='/workspace/save/' \
--auth_filename='/workspace/h2ogpt_auth/auth.json' \
--use_gpu_id=False
Traceback (most recent call last):
File "/workspace/generate.py", line 16, in
FYI --use_gpu_id
has no effect for GGUF (llama.cpp) models as they have no multi-GPU control. The only way to specify which GPUs to use is to set CUDA_VISIBLE_DEVICES
as described in the FAQ.md, README.md for windows, and in the UI side-panel for models tab. This is not great, because then entire h2oGPT is limited like that, but there's no work-around AFAIK until they add that feature.
I don't have any issue with Mixtral using GGUF in docker with latest image:
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:/workspace/.cache \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
--base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \
--prompt_type=mistral \
--save_dir='/workspace/save/' \
--use_gpu_id=False \
--user_path=/workspace/user_path \
--langchain_mode="LLM" \
--langchain_modes="['UserData', 'LLM']" \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024
With HF GPTQ with latest transformers==4.36.1 and latest 0.6.0 auto-gptq:
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:/workspace/.cache \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
--base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--prompt_type=mistral \
--save_dir='/workspace/save/' \
--use_gpu_id=False \
--user_path=/workspace/user_path \
--langchain_mode="LLM" \
--langchain_modes="['UserData', 'LLM']" \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024
I hit:
File "/workspace/src/gen.py", line 2787, in get_hf_model
model = model_loader(
File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3523, in from_pretrained
model = quantizer.convert_model(model)
File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 229, in convert_model
self._replace_by_quant_layers(model, layers_to_be_replaced)
File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
[Previous line repeated 1 more time]
File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 282, in _replace_by_quant_layers
new_layer = QuantLinear(
File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 68, in __init__
assert outfeatures % 32 == 0
AssertionError
https://github.com/PanQiWei/AutoGPTQ/issues/486
https://github.com/PanQiWei/AutoGPTQ/issues/486#issuecomment-1859007200
Need to use latest transformers off main branch or wait for 4.36.2.
I'll close this issue once 4.36.2 added.
You can use non-transformers way to load GPTQ, like this:
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:/workspace/.cache \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
--base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--prompt_type=mistral \
--save_dir='/workspace/save/' \
--use_gpu_id=False \
--user_path=/workspace/user_path \
--langchain_mode="LLM" \
--langchain_modes="['UserData', 'LLM']" \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024 \
--use_autogptq=True \
--load_gptq=model \
--use_safetensors=True
Takes a while to load, for whatever reason. Like about 4 minutes after quantization stuff. But does eventually come up.
sudo docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ --rm --init \ --network host \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u
id -u
:id -g
\ -v "${HOME}"/.cache:/workspace/.cache \ -v "${HOME}"/save:/workspace/save \ -v "${HOME}"/user_path:/workspace/user_path \ -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/users:/workspace/users \ -v "${HOME}"/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \ -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \ gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ --base_model=mistralai/Mixtral-8x7B-Instruct-v0.1 \ --use_safetensors=True \ --prompt_type=mistral \ --save_dir='/workspace/save/' \ --auth_filename='/workspace/h2ogpt_auth/auth.json' \ --load_8bit=True \ --use_gpu_id=False --max_seq_len=4096Loads the model, but gives me the following error when I ask a question:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results. Settingpad_token_id
toeos_token_id
:2 for open-end generation. A: torch.Size([13, 4096]), B: torch.Size([4096, 4096]), C: (13, 4096); (lda, ldb, ldc): (c_int(416), c_int(131072), c_int(416)); (m, n, k): (c_int(13), c_int(4096), c_int(4096)) GPU Error: exception: cublasLt ran into an error! cuBLAS API failed with status 15 error detectedTraceback (most recent call last): File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error! thread exception: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) make stop: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) hit stop Traceback (most recent call last): File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction output = await route_utils.call_process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api output = await app.get_blocks().process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api result = await self.call_function( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1199, in call_function prediction = await utils.async_iteration(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 519, in async_iteration return await iterator.anext() File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 512, in anext return await anyio.to_thread.run_sync( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async return next(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 649, in gen_wrapper yield from f(args, kwargs) File "/workspace/src/gradio_runner.py", line 4166, in bot for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1, File "/workspace/src/gradio_runner.py", line 4068, in get_response for output_fun in fun1(): File "/workspace/src/gen.py", line 4278, in evaluate raise thread.exc File "/workspace/src/utils.py", line 451, in run self._return = self._target(*self._args, self._kwargs) File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, **kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error!