Error with sharded Mixtral

System Info

128 gb RAM. On-premise machine with 2 GPUs.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   35C    P0    51W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   35C    P0    52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Information

[X] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[X] My own modifications

Reproduction

docker run --gpus all \
    -e HF_HOME=/data \
    -e CUDA_VISIBLE_DEVICES=all \
    -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e NCCL_P2P_DISABLE=1 \
    -e TRITON_LIBCUDA_PATH=/usr/local/cuda-12.1/compat/ \
    -v /storage/tf_cache/:/data \
    -p 8000:8000 \
    --rm \
    --ipc=host \
    --shm-size=100Gb \
    --name tgi_test \
    ghcr.io/huggingface/text-generation-inference:2.1.0 \
    --model-id OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k --trust-remote-code --port 8000 --hostname 0.0.0.0 --num-shard 2

Expected behavior

logs:

2024-06-28T11:20:19.574965Z  INFO text_generation_launcher: Args {
    model_id: "OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        2,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 8000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-06-28T11:20:19.575029Z  INFO hf_hub: Token file not found "/data/token"    
2024-06-28T11:20:19.576535Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-06-28T11:20:19.576541Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-28T11:20:19.576544Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-28T11:20:19.576546Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-28T11:20:19.576548Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-28T11:20:19.576551Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k` do not contain malicious code.
2024-06-28T11:20:19.576553Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-06-28T11:20:19.576633Z  INFO download: text_generation_launcher: Starting download process.
2024-06-28T11:20:21.398471Z  INFO text_generation_launcher: Detected system cpu
2024-06-28T11:20:23.045399Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-28T11:20:23.697443Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-28T11:20:23.697672Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-28T11:20:23.697685Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-06-28T11:20:25.649885Z  INFO text_generation_launcher: Detected system cpu
2024-06-28T11:20:25.654759Z  INFO text_generation_launcher: Detected system cpu
2024-06-28T11:20:26.950985Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)
2024-06-28T11:20:26.956319Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)
2024-06-28T11:20:27.276862Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 775, in get_model
    raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Mixtral"))
NotImplementedError: Sharded Mixtral requires Flash Attention enabled models.
2024-06-28T11:20:27.281308Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 775, in get_model
    raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Mixtral"))
NotImplementedError: Sharded Mixtral requires Flash Attention enabled models.
2024-06-28T11:20:27.901622Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 775, in get_model
    raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Mixtral"))

NotImplementedError: Sharded Mixtral requires Flash Attention enabled models.
 rank=0
2024-06-28T11:20:27.901921Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 775, in get_model
    raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Mixtral"))

NotImplementedError: Sharded Mixtral requires Flash Attention enabled models.
 rank=1
2024-06-28T11:20:28.001594Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-28T11:20:28.001605Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

I understand that this error is due to the fact that this module cannot be imported: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/__init__.py#L53-L92 WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)

i found out that it is my mistake. in env -e CUDA_VISIBLE_DEVICES=0,1. But now I have another issue

docker run --gpus all \
    -e HF_HOME=/data \
    -e CUDA_VISIBLE_DEVICES=0,1 \
    -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e NCCL_P2P_DISABLE=1 \
    -e TRITON_LIBCUDA_PATH=/usr/local/cuda-12.1/compat/ \
    -v /storage/tf_cache/:/data \
    -p 8000:8000 \
    --rm \
    --ipc=host \
    --shm-size=100Gb \
    --name tgi_test \
    ghcr.io/huggingface/text-generation-inference:2.1.0 \
    --model-id OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k --trust-remote-code --port 8000 --hostname 0.0.0.0 --num-shard 2

LOGS:

2024-06-28T11:27:02.183907Z  INFO text_generation_launcher: Args {
    model_id: "OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        2,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 8000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-06-28T11:27:02.183972Z  INFO hf_hub: Token file not found "/data/token"    
2024-06-28T11:27:02.185611Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-06-28T11:27:02.185618Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-28T11:27:02.185620Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-28T11:27:02.185622Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-28T11:27:02.185624Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-28T11:27:02.185628Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `OpenBuddy/openbuddy-mixtral-7bx8-v18.1-32k` do not contain malicious code.
2024-06-28T11:27:02.185630Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-06-28T11:27:02.185716Z  INFO download: text_generation_launcher: Starting download process.
2024-06-28T11:27:05.000830Z  INFO text_generation_launcher: Detected system cuda
2024-06-28T11:27:07.021516Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-28T11:27:07.698392Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-28T11:27:07.698638Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-28T11:27:07.698639Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-06-28T11:27:10.655958Z  INFO text_generation_launcher: Detected system cuda
2024-06-28T11:27:10.749024Z  INFO text_generation_launcher: Detected system cuda
2024-06-28T11:27:17.706926Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:27:17.707165Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:27:27.720205Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:27:27.740447Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:27:37.806208Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:27:37.812606Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:27:47.900818Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:27:47.904761Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:27:57.915449Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:27:57.916446Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:07.924304Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:07.939241Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:17.995423Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:18.012241Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:28.012532Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:28.102771Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:38.027759Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:38.117450Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:48.100639Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:48.207074Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:28:58.114913Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:28:58.220436Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:08.131400Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:29:08.239067Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:18.218077Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:29:18.319564Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:28.227565Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:29:28.329419Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:38.307420Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:29:38.342966Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:48.402267Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:29:48.412476Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:58.422970Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:29:58.423067Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:08.501083Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:08.525819Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:18.526233Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:18.619383Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:28.606955Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:28.711349Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:38.666041Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:38.771558Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:48.717671Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:48.883301Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:30:58.769347Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-28T11:30:58.887333Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-28T11:31:04.854994Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 766, in get_model
    return FlashMixtral(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in __init__
    super(FlashMixtral, self).__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 97, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 822, in __init__
    super(FlashCausalLM, self).__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/model.py", line 63, in __init__
    self.target_to_layer = self.adapter_target_to_layer()
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 156, in adapter_target_to_layer
    if hasattr(layer.mlp, "gate_up_proj"):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'MixtralLayer' object has no attribute 'mlp'
2024-06-28T11:31:04.857312Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 766, in get_model
    return FlashMixtral(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in __init__
    super(FlashMixtral, self).__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 97, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 822, in __init__
    super(FlashCausalLM, self).__init__(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/model.py", line 63, in __init__
    self.target_to_layer = self.adapter_target_to_layer()
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 156, in adapter_target_to_layer
    if hasattr(layer.mlp, "gate_up_proj"):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'MixtralLayer' object has no attribute 'mlp'
2024-06-28T11:31:08.060487Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[rank0]: Traceback (most recent call last):

[rank0]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank0]:     sys.exit(app())

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
[rank0]:     server.serve(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
[rank0]:     asyncio.run(

[rank0]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)

[rank0]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
[rank0]:     model = get_model(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 766, in get_model
[rank0]:     return FlashMixtral(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mixtral.py", line 22, in __init__
[rank0]:     super(FlashMixtral, self).__init__(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 97, in __init__
[rank0]:     super().__init__(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 822, in __init__
[rank0]:     super(FlashCausalLM, self).__init__(

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/model.py", line 63, in __init__
[rank0]:     self.target_to_layer = self.adapter_target_to_layer()

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 156, in adapter_target_to_layer
[rank0]:     if hasattr(layer.mlp, "gate_up_proj"):

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

[rank0]: AttributeError: 'MixtralLayer' object has no attribute 'mlp'
 rank=0
2024-06-28T11:31:08.079185Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-28T11:31:08.079210Z  INFO text_generation_launcher: Shutting down shards
2024-06-28T11:31:08.159265Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-06-28T11:31:08.159383Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-06-28T11:31:08.860057Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: ShardCannotStart

huggingface / text-generation-inference