opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
219 stars 134 forks source link

the lastest TGI cannot bring up with "Intel/neural-chat-7b-v3-3" #636

Closed KfreeZ closed 2 days ago

KfreeZ commented 3 weeks ago

the TGI image with label "text-generation-inference:latest-intel-cpu" bring up failed with "Intel/neural-chat-7b-v3-3" after the image upgraded to the build of "Created": "2024-08-20T20:17:15.742628941Z", previous build on 8/16 is ok.

the error message:

$ kubectl logs -n opea-app-audioqa     tgi-svc-deployment-7b6bcf886-bdpgs
{"timestamp":"2024-08-21T00:49:11.847340Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"Intel/neural-chat-7b-v3-3\",\n    revision: None,\n    validation_workers: 2,\n    sharded: None,\n    num_shard: None,\n    quantize: None,\n    speculate: None,\n    dtype: None,\n    trust_remote_code: false,\n    max_concurrent_requests: 128,\n    max_best_of: 2,\n    max_stop_sequences: 4,\n    max_top_n_tokens: 5,\n    max_input_tokens: None,\n    max_input_length: None,\n    max_total_tokens: None,\n    waiting_served_ratio: 0.3,\n    max_batch_prefill_tokens: None,\n    max_batch_total_tokens: None,\n    max_waiting_tokens: 20,\n    max_batch_size: None,\n    cuda_graphs: Some(\n        [\n            0,\n        ],\n    ),\n    hostname: \"tgi-svc-deployment-7b6bcf886-bdpgs\",\n    port: 2080,\n    shard_uds_path: \"/tmp/text-generation-server\",\n    master_addr: \"localhost\",\n    master_port: 29500,\n    huggingface_hub_cache: Some(\n        \"/data\",\n    ),\n    weights_cache_override: None,\n    disable_custom_kernels: false,\n    cuda_memory_fraction: 1.0,\n    rope_scaling: None,\n    rope_factor: None,\n    json_output: true,\n    otlp_endpoint: None,\n    otlp_service_name: \"text-generation-inference.router\",\n    cors_allow_origin: [],\n    api_key: None,\n    watermark_gamma: None,\n    watermark_delta: None,\n    ngrok: false,\n    ngrok_authtoken: None,\n    ngrok_edge: None,\n    tokenizer_config_path: None,\n    disable_grammar_support: false,\n    env: false,\n    max_client_batch_size: 4,\n    lora_adapters: None,\n    usage_stats: On,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:11.847563Z","level":"INFO","fields":{"message":"Token file not found \"/tmp/.cache/huggingface/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55},"target":"hf_hub"}
{"timestamp":"2024-08-21T00:49:11.847710Z","level":"INFO","fields":{"message":"Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:11.847721Z","level":"INFO","fields":{"message":"Default `max_input_tokens` to 4095"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:11.847730Z","level":"INFO","fields":{"message":"Default `max_total_tokens` to 4096"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:11.847736Z","level":"INFO","fields":{"message":"Default `max_batch_prefill_tokens` to 4145"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:11.848049Z","level":"INFO","fields":{"message":"Starting check and download process for Intel/neural-chat-7b-v3-3"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-21T00:49:15.971377Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:16.656881Z","level":"INFO","fields":{"message":"Successfully downloaded weights for Intel/neural-chat-7b-v3-3"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-21T00:49:16.657558Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-21T00:49:19.436194Z","level":"WARN","fields":{"message":"FBGEMM fp8 kernels are not installed."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:19.454895Z","level":"INFO","fields":{"message":"Using prefix caching = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:19.454937Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:19.507610Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:19.669806Z","level":"INFO","fields":{"message":"affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47}, membind = {0}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.081253Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.168023Z","level":"INFO","fields":{"message":"Shard ready in 6.508041365s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-21T00:49:23.265279Z","level":"INFO","fields":{"message":"Starting Webserver"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.311192Z","level":"INFO","message":"Warming up model","target":"text_generation_router_v3","filename":"backends/v3/src/lib.rs","line_number":90}
{"timestamp":"2024-08-21T00:49:23.401568Z","level":"ERROR","fields":{"message":"Method Warmup encountered an error.\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 109, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 274, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n  File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n    return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n    return await response\n  File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 120, in _unary_interceptor\n    raise error\n  File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 111, in _unary_interceptor\n    return await behavior(request_or_iterator, context)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 123, in Warmup\n    max_supported_total_tokens = self.model.warmup(batch)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1251, in warmup\n    _, batch, _ = self.generate_token(batch)\n  File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n    return func(*args, **kwds)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1565, in generate_token\n    out, speculative_logits = self.forward(batch, adapter_data)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1470, in forward\n    logits, speculative_logits = self.model.forward(\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 518, in forward\n    hidden_states = self.model(\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1553, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 442, in forward\n    hidden_states, residual = layer(\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1553, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 367, in forward\n    attn_output = self.self_attn(\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1553, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1562, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 218, in forward\n    attn_output = attention(\nTypeError: attention() got multiple values for argument 'window_size_left'"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.402174Z","level":"ERROR","message":"Server error: attention() got multiple values for argument 'window_size_left'","target":"text_generation_router_v3::client","filename":"backends/v3/src/client/mod.rs","line_number":54,"span":{"name":"warmup"},"spans":[{"max_batch_size":"None","max_input_length":4095,"max_prefill_tokens":4145,"max_total_tokens":4096,"name":"warmup"},{"name":"warmup"}]}
Error: Backend(Warmup(Generation("attention() got multiple values for argument 'window_size_left'")))
{"timestamp":"2024-08-21T00:49:23.470388Z","level":"ERROR","fields":{"message":"Webserver Crashed"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.470429Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-21T00:49:23.568692Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-21T00:49:23.568821Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: WebserverFailed
{"timestamp":"2024-08-21T00:49:25.972331Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}

the image inspects:

/# sudo nerdctl -n k8s.io image inspect ghcr.io/huggingface/text-generation-inference:latest-intel-cpu
[
    {
        "Id": "sha256:359b28ca97a358eb3a9271b0acce28f4f598ea21fc1e650a5988e6c04729e11c",
        "RepoTags": [
            "ghcr.io/huggingface/text-generation-inference:latest-intel-cpu"
        ],
        "RepoDigests": [
            "ghcr.io/huggingface/text-generation-inference@sha256:1770fee18d98272d991d26b48884c52917a62875e8276e3e75bacd133be60903"
        ],
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-08-20T20:17:15.742628941Z",
        "Author": "",
        "Config": {
            "AttachStdin": false,
            "Env": [
                "PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "HUGGINGFACE_HUB_CACHE=/data",
                "HF_HUB_ENABLE_HF_TRANSFER=1",
                "PORT=80",
                "LD_PRELOAD=/opt/conda/lib/libtcmalloc.so",
                "CCL_ROOT=/opt/conda/lib/python3.10/site-packages/oneccl_bindings_for_pytorch",
                "I_MPI_ROOT=/opt/conda/lib/python3.10/site-packages/oneccl_bindings_for_pytorch",
                "FI_PROVIDER_PATH=/opt/conda/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric",
                "LD_LIBRARY_PATH=/opt/conda/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/opt/mpi/libfabric/lib:/opt/conda/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib"
            ],
            "Cmd": [
                "--json-output"
            ],
            "WorkingDir": "/usr/src",
            "Entrypoint": [
                "text-generation-launcher"
            ],
            "Labels": {
                "org.opencontainers.image.created": "2024-08-20T20:12:46.746Z",
                "org.opencontainers.image.description": "Large Language Model Text Generation Inference",
                "org.opencontainers.image.licenses": "Apache-2.0",
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.revision": "f5f11b797e70b2232632d410273c5c4418475dd1",
                "org.opencontainers.image.source": "https://github.com/huggingface/text-generation-inference",
                "org.opencontainers.image.title": "text-generation-inference",
                "org.opencontainers.image.url": "https://github.com/huggingface/text-generation-inference",
                "org.opencontainers.image.version": "latest-intel-cpu"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 10394152960,
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:1b9b7346fee7abbc7f5538eaa23548bd05a45abe8daf6794024be0c8ad7d60bb",
                "sha256:88ed08d8e41dba88f5e0cd5d53ee786f7bc423cf272dea82e18740528be71a6a",
                "sha256:5a01ddd7030496b7266e17b5a226251c2d8a72e5ff36b9624e73e722ae62eb69",
                "sha256:66ed3b9200bcb5281ed7da0165da4da7e9860fc217e3960a886d7b3eb2fea0e3",
                "sha256:02a41d18ee51f951248c55f62e1552991cfffca1e11a4b504a90989c94e5260d",
                "sha256:2b2f1304fe01f24ab2851d4ece7f0e09722ba953597421831db7fe3d6d30a346",
                "sha256:b7830aeda0e9892ad48b0dc448f5996b540525ff232612898e6ba85b087eba2c",
                "sha256:9f38df70ad89b7a61fd17c473faf83ffe1dc3c0408a6944c446ff0513c97caa7",
                "sha256:ca580947c18aaa7e603aa532893c63d95903785cdf0ec970793df93957adacf4",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:3044d89cfe229e1d6265056517a6802943c20894bebd41dc5cd4820fa7415dae",
                "sha256:066bd85fa9241b90cb00d1a04d6571b1d334dedd3568c572e634c688461a7358",
                "sha256:b6c9cb007a59a64b117a1d2685ed962bbf59b7c6dd474bbb2c0789d2ab3f988e",
                "sha256:89385793823386efaf9a692e90b6aa34609af3d6c83c3ebf151389378400b133",
                "sha256:7a502fc83ed193dba6c3f8d5644159fef776e42413fc7be1340b4d18ab4a2cd2",
                "sha256:cca3537968cc5d529d23b8a24e6c48dff4c9d4d5f979632fefda210d741666b7",
                "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
                "sha256:655ddb8f8a002a9015ebafc863ff7518b8f476153d6f7421ef1c7b25dad40887",
                "sha256:5f3668e73284c5a328747ec47dae6acd3f92b20a348c61b62ac883f3b4a221b9",
                "sha256:ed4aa8f60efb1c4b7cca78768a92a129556d764e1960b6c45c376d23dde266cf",
                "sha256:83622ddbe858761456bbc152ced9558d312bb65a779554167b94d7eb2e240be1"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]
yongfengdu commented 3 weeks ago

Maybe we should consider a stable version for milestone releases? See also #625

KfreeZ commented 3 weeks ago

Maybe we should consider a stable version for milestone releases? See also #625

this happened in GenAIInfra's CI/CD, all tests failed over the night the tgi images is bumped

eero-t commented 3 weeks ago

Is this reported in upstream TGI project? Ticket link is needed so we know which upstream release(s) will provide a fix for it.

lvliang-intel commented 2 days ago

fixed by https://github.com/opea-project/GenAIExamples/pull/641