Gemma model does not work on CPU

System Info

Docker: ghcr.io/huggingface/text-generation-inference:1.4 Platform: NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Information

[X] Docker
[x] The CLI directly

Tasks

[X] An officially supported command
[x] My own modifications

Reproduction

Run docker with GPU and gemma-2b model

docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN ghcr.io/huggingface/text-generation-inference:1.4.3 --model-id google/gemma-2b

Call the server

curl 0.0.0.0:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Output:

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\n"}%

Now start docker with CPU and gemma-2b model

docker run -it --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN ghcr.io/huggingface/text-generation-inference:1.4 --model-id google/gemma-2b

Getting error:

2024-03-09T02:40:58.170351Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 422, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type gemma
 rank=0
2024-03-09T02:40:58.269695Z ERROR text_generation_launcher: Shard 0 failed to start
2024-03-09T02:40:58.269721Z  INFO text_generation_launcher: Shutting down shards

Expected behavior

TGI should be able to run gemma models on CPU same as on GPU

huggingface / text-generation-inference

Gemma model does not work on CPU #1635

System Info

Information

Tasks

Reproduction

Expected behavior