Mixtral in docker - Githubissues

sudo docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ --rm --init \ --network host \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u:id -g \ -v "${HOME}"/.cache:/workspace/.cache \ -v "${HOME}"/save:/workspace/save \ -v "${HOME}"/user_path:/workspace/user_path \ -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/users:/workspace/users \ -v "${HOME}"/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \ -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \ gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ --base_model=mistralai/Mixtral-8x7B-Instruct-v0.1 \ --use_safetensors=True \ --prompt_type=mistral \ --save_dir='/workspace/save/' \ --auth_filename='/workspace/h2ogpt_auth/auth.json' \ --load_8bit=True \ --use_gpu_id=False --max_seq_len=4096

Loads the model, but gives me the following error when I ask a question:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:2 for open-end generation. A: torch.Size([13, 4096]), B: torch.Size([4096, 4096]), C: (13, 4096); (lda, ldb, ldc): (c_int(416), c_int(131072), c_int(416)); (m, n, k): (c_int(13), c_int(4096), c_int(4096)) GPU Error: exception: cublasLt ran into an error! cuBLAS API failed with status 15 error detectedTraceback (most recent call last): File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error! thread exception: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) make stop: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) hit stop Traceback (most recent call last): File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction output = await route_utils.call_process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api output = await app.get_blocks().process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api result = await self.call_function( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1199, in call_function prediction = await utils.async_iteration(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 519, in async_iteration return await iterator.anext() File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 512, in anext return await anyio.to_thread.run_sync( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async return next(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 649, in gen_wrapper yield from f(args, kwargs) File "/workspace/src/gradio_runner.py", line 4166, in bot for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1, File "/workspace/src/gradio_runner.py", line 4068, in get_response for output_fun in fun1(): File "/workspace/src/gen.py", line 4278, in evaluate raise thread.exc File "/workspace/src/utils.py", line 451, in run self._return = self._target(*self._args, self._kwargs) File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, **kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error!

I don't know if bitsandbytes will work with Mixtral. That would be question for bitsandbytes team.

I recommend trying AWQ instead once it's ready: https://github.com/casper-hansen/AutoAWQ/issues/259

For bits and bytes, try 4-bit instead of 8-bit:

https://huggingface.co/blog/mixtral#load-mixtral-with-4-bit-quantization

And you can try the different low_bit_mode values to see what works best. The equivalent to the blog above would be --low_bit_mode=1 --load_4bit=True

low_bit_mode=1 and load_4bit=true fixed my issue thank you!!

That said when I try to load the gptq it just fills up a single one of my 4 gpus what are the correct docker run settings to balance it or can I specify the amount of vram per gpu?

(Sorry for the dumb questions and thank you for all of the suggestions, hard work and help)

I want to try the gptq or gguf versions but i cant figure out how to get them to load. the gguf errors out as below and the gptq fills one of my 4 gpus up and error oom.

Command: sudo docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ --rm --init \ --network host \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u:id -g \ -v "${HOME}"/.cache:/workspace/.cache \ -v "${HOME}"/save:/workspace/save \ -v "${HOME}"/user_path:/workspace/user_path \ -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/users:/workspace/users \ -v "${HOME}"/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \ -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \ gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \ --use_safetensors=True \ --prompt_type=mistral \ --save_dir='/workspace/save/' \ --auth_filename='/workspace/h2ogpt_auth/auth.json' \ --use_gpu_id=False

Traceback (most recent call last): File "/workspace/generate.py", line 16, in entrypoint_main() File "/workspace/generate.py", line 12, in entrypoint_main H2O_Fire(main) File "/workspace/src/utils.py", line 65, in H2O_Fire fire.Fire(component=component, command=args) File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/workspace/src/gen.py", line 1896, in main model0, tokenizer0, device = get_model_retry(reward_type=False, File "/workspace/src/gen.py", line 2220, in get_model_retry model1, tokenizer1, device1 = get_model(kwargs) File "/workspace/src/gen.py", line 2573, in get_model model, tokenizer, device = get_model_tokenizer_gpt4all(base_model, File "/workspace/src/gpt4all_llm.py", line 30, in get_model_tokenizer_gpt4all model, tokenizer, redo, max_seq_len = get_llm_gpt4all(llama_kwargs) File "/workspace/src/gpt4all_llm.py", line 184, in get_llm_gpt4all llm = cls(model_kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/load/serializable.py", line 97, in init super().init(**kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: llamacpp_path/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf. Received error (type=value_error)

FYI --use_gpu_id has no effect for GGUF (llama.cpp) models as they have no multi-GPU control. The only way to specify which GPUs to use is to set CUDA_VISIBLE_DEVICES as described in the FAQ.md, README.md for windows, and in the UI side-panel for models tab. This is not great, because then entire h2oGPT is limited like that, but there's no work-around AFAIK until they add that feature.

I don't have any issue with Mixtral using GGUF in docker with latest image:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

With HF GPTQ with latest transformers==4.36.1 and latest 0.6.0 auto-gptq:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

I hit:

  File "/workspace/src/gen.py", line 2787, in get_hf_model
    model = model_loader(
  File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3523, in from_pretrained
    model = quantizer.convert_model(model)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 229, in convert_model
    self._replace_by_quant_layers(model, layers_to_be_replaced)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  [Previous line repeated 1 more time]
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 282, in _replace_by_quant_layers
    new_layer = QuantLinear(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 68, in __init__
    assert outfeatures % 32 == 0
AssertionError

https://github.com/PanQiWei/AutoGPTQ/issues/486

https://github.com/PanQiWei/AutoGPTQ/issues/486#issuecomment-1859007200

Need to use latest transformers off main branch or wait for 4.36.2.

I'll close this issue once 4.36.2 added.

You can use non-transformers way to load GPTQ, like this:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024 \
          --use_autogptq=True \
          --load_gptq=model \
          --use_safetensors=True

Takes a while to load, for whatever reason. Like about 4 minutes after quantization stuff. But does eventually come up.

h2oai / h2ogpt

Mixtral in docker #1216