h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.41k stars 1.25k forks source link

Mixtral in docker #1216

Closed alexg711 closed 10 months ago

alexg711 commented 11 months ago

sudo docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ --rm --init \ --network host \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u:id -g \ -v "${HOME}"/.cache:/workspace/.cache \ -v "${HOME}"/save:/workspace/save \ -v "${HOME}"/user_path:/workspace/user_path \ -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/users:/workspace/users \ -v "${HOME}"/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \ -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \ gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ --base_model=mistralai/Mixtral-8x7B-Instruct-v0.1 \ --use_safetensors=True \ --prompt_type=mistral \ --save_dir='/workspace/save/' \ --auth_filename='/workspace/h2ogpt_auth/auth.json' \ --load_8bit=True \ --use_gpu_id=False --max_seq_len=4096

Loads the model, but gives me the following error when I ask a question: image

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:2 for open-end generation. A: torch.Size([13, 4096]), B: torch.Size([4096, 4096]), C: (13, 4096); (lda, ldb, ldc): (c_int(416), c_int(131072), c_int(416)); (m, n, k): (c_int(13), c_int(4096), c_int(4096)) GPU Error: exception: cublasLt ran into an error! cuBLAS API failed with status 15 error detectedTraceback (most recent call last): File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error! thread exception: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) make stop: (<class 'Exception'>, Exception('cublasLt ran into an error!'), <traceback object at 0x7f4338160040>) hit stop Traceback (most recent call last): File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction output = await route_utils.call_process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api output = await app.get_blocks().process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api result = await self.call_function( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1199, in call_function prediction = await utils.async_iteration(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 519, in async_iteration return await iterator.anext() File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 512, in anext return await anyio.to_thread.run_sync( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async return next(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 649, in gen_wrapper yield from f(args, kwargs) File "/workspace/src/gradio_runner.py", line 4166, in bot for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1, File "/workspace/src/gradio_runner.py", line 4068, in get_response for output_fun in fun1(): File "/workspace/src/gen.py", line 4278, in evaluate raise thread.exc File "/workspace/src/utils.py", line 451, in run self._return = self._target(*self._args, self._kwargs) File "/workspace/src/gen.py", line 4435, in generate_with_exceptions func(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1212, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1080, in forward layer_outputs = decoder_layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 796, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 304, in forward query_states = self.q_proj(hidden_states) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, *kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, **kwargs) # type: ignore[misc] File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 401, in forward out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB) File "/h2ogpt_conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1792, in igemmlt raise Exception('cublasLt ran into an error!') Exception: cublasLt ran into an error!

pseudotensor commented 11 months ago

I don't know if bitsandbytes will work with Mixtral. That would be question for bitsandbytes team.

pseudotensor commented 11 months ago

I recommend trying AWQ instead once it's ready: https://github.com/casper-hansen/AutoAWQ/issues/259

pseudotensor commented 11 months ago

Or you can try GPTQ: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

pseudotensor commented 11 months ago

Or GGUF: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF

So there are many options.

pseudotensor commented 11 months ago

For bits and bytes, try 4-bit instead of 8-bit:

https://huggingface.co/blog/mixtral#load-mixtral-with-4-bit-quantization

And you can try the different low_bit_mode values to see what works best. The equivalent to the blog above would be --low_bit_mode=1 --load_4bit=True

alexg711 commented 10 months ago

low_bit_mode=1 and load_4bit=true fixed my issue thank you!!

That said when I try to load the gptq it just fills up a single one of my 4 gpus what are the correct docker run settings to balance it or can I specify the amount of vram per gpu?

(Sorry for the dumb questions and thank you for all of the suggestions, hard work and help)

I want to try the gptq or gguf versions but i cant figure out how to get them to load. the gguf errors out as below and the gptq fills one of my 4 gpus up and error oom.

Command: sudo docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ --rm --init \ --network host \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u:id -g \ -v "${HOME}"/.cache:/workspace/.cache \ -v "${HOME}"/save:/workspace/save \ -v "${HOME}"/user_path:/workspace/user_path \ -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/users:/workspace/users \ -v "${HOME}"/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \ -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \ gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \ --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \ --use_safetensors=True \ --prompt_type=mistral \ --save_dir='/workspace/save/' \ --auth_filename='/workspace/h2ogpt_auth/auth.json' \ --use_gpu_id=False

Traceback (most recent call last): File "/workspace/generate.py", line 16, in entrypoint_main() File "/workspace/generate.py", line 12, in entrypoint_main H2O_Fire(main) File "/workspace/src/utils.py", line 65, in H2O_Fire fire.Fire(component=component, command=args) File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/workspace/src/gen.py", line 1896, in main model0, tokenizer0, device = get_model_retry(reward_type=False, File "/workspace/src/gen.py", line 2220, in get_model_retry model1, tokenizer1, device1 = get_model(kwargs) File "/workspace/src/gen.py", line 2573, in get_model model, tokenizer, device = get_model_tokenizer_gpt4all(base_model, File "/workspace/src/gpt4all_llm.py", line 30, in get_model_tokenizer_gpt4all model, tokenizer, redo, max_seq_len = get_llm_gpt4all(llama_kwargs) File "/workspace/src/gpt4all_llm.py", line 184, in get_llm_gpt4all llm = cls(model_kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/load/serializable.py", line 97, in init super().init(**kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: llamacpp_path/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf. Received error (type=value_error)

pseudotensor commented 10 months ago

FYI --use_gpu_id has no effect for GGUF (llama.cpp) models as they have no multi-GPU control. The only way to specify which GPUs to use is to set CUDA_VISIBLE_DEVICES as described in the FAQ.md, README.md for windows, and in the UI side-panel for models tab. This is not great, because then entire h2oGPT is limited like that, but there's no work-around AFAIK until they add that feature.

pseudotensor commented 10 months ago

I don't have any issue with Mixtral using GGUF in docker with latest image:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

image

pseudotensor commented 10 months ago

With HF GPTQ with latest transformers==4.36.1 and latest 0.6.0 auto-gptq:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

I hit:

  File "/workspace/src/gen.py", line 2787, in get_hf_model
    model = model_loader(
  File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3523, in from_pretrained
    model = quantizer.convert_model(model)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 229, in convert_model
    self._replace_by_quant_layers(model, layers_to_be_replaced)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
    self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
  [Previous line repeated 1 more time]
  File "/h2ogpt_conda/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 282, in _replace_by_quant_layers
    new_layer = QuantLinear(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 68, in __init__
    assert outfeatures % 32 == 0
AssertionError

https://github.com/PanQiWei/AutoGPTQ/issues/486

https://github.com/PanQiWei/AutoGPTQ/issues/486#issuecomment-1859007200

Need to use latest transformers off main branch or wait for 4.36.2.

pseudotensor commented 10 months ago

I'll close this issue once 4.36.2 added.

pseudotensor commented 10 months ago

You can use non-transformers way to load GPTQ, like this:

mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
export GRADIO_SERVER_PORT=7860
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -v "${HOME}"/user_path:/workspace/user_path \
       -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
       -v "${HOME}"/users:/workspace/users \
       -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
       -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
       -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
          --prompt_type=mistral \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --user_path=/workspace/user_path \
          --langchain_mode="LLM" \
          --langchain_modes="['UserData', 'LLM']" \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024 \
          --use_autogptq=True \
          --load_gptq=model \
          --use_safetensors=True

Takes a while to load, for whatever reason. Like about 4 minutes after quantization stuff. But does eventually come up.

image