huggingface / optimum-tpu

Google TPU optimizations for transformers models
Apache License 2.0
66 stars 17 forks source link

Issue running mistral-7b-instruct-v0-3 on Inference Endpoint #76

Closed pagezyhf closed 2 months ago

pagezyhf commented 2 months ago

Hello, I am using the HF UI for testing the model and once the endpoint is running I get an error when testing it.

Logs: 2024/07/12 18:12:41 {"timestamp":"2024-07-12T22:12:41.828554Z","level":"ERROR","fields":{"message":"Method Prefill encountered an error.\nTraceback (most recent call last):\n File \"/usr/local/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/usr/local/lib/python3.10/dist-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/usr/local/lib/python3.10/dist-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/usr/local/lib/python3.10/dist-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py\", line 69, in serve\n serve(\n File \"/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py\", line 107, in serve\n asyncio.run(serve_inner(model_path))\n File \"/usr/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/usr/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/usr/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/usr/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/usr/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py\", line 159, in invoke_intercept_method\n return await self.intercept(\n> File \"/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py\", line 20, in intercept\n return await response\n File \"/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py\", line 54, in Prefill\n generations, batch = self.generator.prefill(request.batch)\n File \"/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py\", line 897, in prefill\n s_generations, s_cached_batch = self.mailbox.send(GeneratorCommand.PREFILL, batch.SerializeToString())\n File \"/opt/optimum-tpu/optimum/tpu/xla_mp_comm.py\", line 36, in send\n raise RuntimeError(\"Error on one of threads, stopping.\")\nRuntimeError: Error on one of threads, stopping.\n"},"target":"text_generation_launcher"} 2024/07/12 18:12:41 {"timestamp":"2024-07-12T22:12:41.828724Z","level":"ERROR","message":"Server error: Error on one of threads, stopping.","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":33,"span":{"id":0,"size":1,"name":"prefill"},"spans":[{"batch_size":1,"name":"batch"},{"name":"prefill"},{"id":0,"size":1,"name":"prefill"},{"id":0,"size":1,"name":"prefill"}]}

The base version works well.

Neal de Buhr comments: "the Instruct versions are a multi-turn chat interface whereas the base models are a single text generation/completion text box".

Simon

tengomucho commented 2 months ago

So the problem seems related with the client call. I haven't been able to reproduce this with a curl request, but I have been able to reproduce it using the web client. I will try to understand where the difference is.