huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.76k stars 1.02k forks source link

Request failed during generation: Server error: 'FlashMixtral' object has no attribute 'compiled_model' #1803

Closed icyxp closed 3 months ago

icyxp commented 4 months ago

System Info

Request failed during generation: Server error: 'FlashMixtral' object has no attribute 'compiled_model'

server/text_generation_server/models/flash_mistral.py 516

Information

Tasks

Reproduction

1

Expected behavior

1

dr3s commented 4 months ago

Ditto, we see this once we hit a certain currency (~20 req on A100). Except we are using Mistral so the error is 'FlashMistral' object has no attribute

dr3s commented 4 months ago

The full command line used that causes issues: see below OS version: Ubuntu

Hardware used (GPUs, how many, on which cloud) (nvidia-smi): A100 80GB Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): GKE The current version being used: ghcr.io/huggingface/text-generation-inference@sha256:deb8ab8e39c8407386c5430c29b725a0fc997444b478a493be3d5218333788c5

Error:

{"timestamp":"2024-04-25T00:59:06.353275Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n  File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 123, in _unary_interceptor\n    return await behavior(request_or_iterator, context)\n  File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n    return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n    return await response\n  File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 132, in _unary_interceptor\n    raise error\n  File \"/otel-auto-instrumentation-python/opentelemetry/instrumentation/grpc/_aio_server.py\", line 123, in _unary_interceptor\n    return await behavior(request_or_iterator, context)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 157, in Decode\n    generations, next_batch, timings = self.model.generate_token(batch)\n  File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n    return func(*args, **kwds)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 947, in generate_token\n    raise e\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 944, in generate_token\n    out, speculative_logits = self.forward(batch)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py\", line 516, in forward\n    logits, speculative_logits = self.compiled_model(\nAttributeError: 'FlashMistral' object has no attribute 'compiled_model'\n"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T00:59:06.354656Z","level":"ERROR","message":"Server error: 'FlashMistral' object has no attribute 

CMD

- args:
    - --json-output
    - --port=8001
    - --max-input-length=4096
    - --max-batch-prefill-tokens=4096
    - --max-total-tokens=8192
    - --cuda-memory-fraction=0.5
    - --otlp-endpoint
    - localhost:4317
    command:
    - text-generation-launcher
    env:
    - name: MODEL_ID
      value: mistralai/Mistral-7B-Instruct-v0.1
    - name: HF_HUB_ENABLE_HF_TRANSFER
      value: "1"
    - name: CUDA_LAUNCH_BLOCKING
      value: "1"

ENV

{"timestamp":"2024-04-25T14:14:04.332439Z","level":"INFO","fields":{"message":"Runtime environment:\nTarget: x86_64-unknown-linux-gnu\nCargo version: 1.75.0\nCommit sha: fccf5edf45836491d8cdd9e2c98d5cde9bae76ab\nDocker label: sha-fccf5ed\nnvidia-smi:\nThu Apr 25 14:14:03 2024       \n   +---------------------------------------------------------------------------------------+\n   | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n   |-----------------------------------------+----------------------+----------------------+\n   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n   |                                         |                      |               MIG M. |\n   |=========================================+======================+======================|\n   |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:00:06.0 Off |                    0 |\n   | N/A   34C    P0              69W / 400W |      4MiB / 81920MiB |      0%      Default |\n   |                                         |                      |             Disabled |\n   +-----------------------------------------+----------------------+----------------------+\n                                                                                            \n   +---------------------------------------------------------------------------------------+\n   | Processes:                                                                            |\n   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n   |        ID   ID                                                             Usage      |\n   |=======================================================================================|\n   |  No running processes found                                                           |\n   +---------------------------------------------------------------------------------------+"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.332506Z","level":"INFO","fields":{"message":"Args { model_id: \"/data/models/mistralai/Mistral-7B-Instruct-v0.1/main\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4096), max_total_tokens: Some(8192), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"gcs-fuse-poc-6f8c67d77f-25mn9\", port: 8001, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.5, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: Some(\"localhost:4317\"), cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333464Z","level":"INFO","fields":{"message":"Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`."},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333485Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2024-04-25T14:14:04.333655Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-04-25T14:14:08.501393Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
dr3s commented 4 months ago

version 1.4.5 doesn't have this issue as far as I can tell.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.