Closed icyxp closed 5 months ago
Ditto, we see this once we hit a certain currency (~20 req on A100). Except we are using Mistral so the error is 'FlashMistral' object has no attribute
The full command line used that causes issues: see below OS version: Ubuntu
Hardware used (GPUs, how many, on which cloud) (nvidia-smi): A100 80GB
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): GKE
The current version being used: ghcr.io/huggingface/text-generation-inference@sha256:deb8ab8e39c8407386c5430c29b725a0fc997444b478a493be3d5218333788c5
Error:
{"timestamp":"2024-04-25T00:59:06.353275Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 123, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n return await response\n File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 132, in _unary_interceptor\n raise error\n File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 123, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 157, in Decode\n generations, next_batch, timings = self.model.generate_token(batch)\n File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n return func(*args, **kwds)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 947, in generate_token\n raise e\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 944, in generate_token\n out, speculative_logits = self.forward(batch)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py\", line 516, in forward\n logits, speculative_logits = self.compiled_model(\nAttributeError: 'FlashMistral' object has no attribute 'compiled_model'\n"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T00:59:06.354656Z","level":"ERROR","message":"Server error: 'FlashMistral' object has no attribute
CMD
- args:
- --json-output
- --port=8001
- --max-input-length=4096
- --max-batch-prefill-tokens=4096
- --max-total-tokens=8192
- --cuda-memory-fraction=0.5
- --otlp-endpoint
- localhost:4317
command:
- text-generation-launcher
env:
- name: MODEL_ID
value: mistralai/Mistral-7B-Instruct-v0.1
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: CUDA_LAUNCH_BLOCKING
value: "1"
ENV
{"timestamp":"2024-04-25T14:14:04.332439Z","level":"INFO","fields":{"message":"Runtime environment:\nTarget: x86_64-unknown-linux-gnu\nCargo version: 1.75.0\nCommit sha: fccf5edf45836491d8cdd9e2c98d5cde9bae76ab\nDocker label: sha-fccf5ed\nnvidia-smi:\nThu Apr 25 14:14:03 2024 \n +---------------------------------------------------------------------------------------+\n | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |\n |-----------------------------------------+----------------------+----------------------+\n | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n | | | MIG M. |\n |=========================================+======================+======================|\n | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 |\n | N/A 34C P0 69W / 400W | 4MiB / 81920MiB | 0% Default |\n | | | Disabled |\n +-----------------------------------------+----------------------+----------------------+\n \n +---------------------------------------------------------------------------------------+\n | Processes: |\n | GPU GI CI PID Type Process name GPU Memory |\n | ID ID Usage |\n |=======================================================================================|\n | No running processes found |\n +---------------------------------------------------------------------------------------+"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.332506Z","level":"INFO","fields":{"message":"Args { model_id: \"/data/models/mistralai/Mistral-7B-Instruct-v0.1/main\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4096), max_total_tokens: Some(8192), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"gcs-fuse-poc-6f8c67d77f-25mn9\", port: 8001, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.5, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: Some(\"localhost:4317\"), cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333464Z","level":"INFO","fields":{"message":"Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`."},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333485Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333655Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-04-25T14:14:08.501393Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
version 1.4.5 doesn't have this issue as far as I can tell.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Request failed during generation: Server error: 'FlashMixtral' object has no attribute 'compiled_model'
server/text_generation_server/models/flash_mistral.py 516
Information
Tasks
Reproduction
1
Expected behavior
1