TGI hard crashes after 1 OOM error

pranavthombare commented 4 months ago

System Info

TGI docker image on GCP. GPU: A100 Model: Phi-3

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Load the Phi3 model (pranavthombare/Phi-3-mini-4k-construct)
run BM command: text-generation-benchmark -t pranavthombare/Phi-3-mini-4k-construct -s 512
After it runs out of memory, run the same command again.

Expected behavior

The TGI launcher should not hard crash.

pranavthombare commented 4 months ago

Below is the error I'm getting


    "timestamp": "2024-05-27T12:04:51.372064Z",
    "level": "ERROR",
    "fields": {
        "message": """'Shard complete standard error output:

        The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
        /opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed inversion 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
        warnings.warn(
        Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
        A new version of the following files was downloaded from https://huggingface.co/pranavthombare/Phi-3-mini-4k-construct:
        - configuration_phi3.py
        . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
        /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class \'text_generation_server.utils.dist.FakeGroup\'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
        warnings.warn(
        Exception ignored in: <function Server.__del__ at 0x7c5ff5530550>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
            cygrpc.schedule_coro_threadsafe(
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
            self._check_closed()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
            raise RuntimeError(\'Event loop is closed\')
        RuntimeError: Event loop is closed
        sys:1: RuntimeWarning: coroutine \'AioServer.shutdown\' was never awaited
        Task exception was never retrieved
        future: <Task finished name=\'Task-2218\' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
            return await response
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
            raise error
        File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
            return await behavior(request_or_iterator, context)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 144, in Prefill
            generations, next_batch, timings = self.model.generate_token(batch)
        File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
            return func(*args, **kwds)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 960, in generate_token
            raise e
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 957, in generate_token
            out, speculative_logits = self.forward(batch)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 900, in forward
            return self.model.forward(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
            hidden_states = self.model(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 340, in forward
            hidden_states, residual = layer(
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 279, in forward
            mlp_output = self.mlp(normed_attn_res_output)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
            return self._call_impl(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
            return forward_call(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 226, in forward
            return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
        torch.cuda.OutOfMemoryError:CUDA out of memory. Tried to allocate 256.00 MiB. GPU 

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
            return get_command(self)(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
            return self.main(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
            return _main(
        File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
            rv = self.invoke(ctx)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
            return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
            return ctx.invoke(self.callback, **ctx.params)
        File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
            return __callback(*args, **kwargs)
        File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
            return callback(**use_params)  # type: ignore
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
            server.serve(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
            asyncio.run(
        File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
            return loop.run_until_complete(main)
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
            self.run_forever()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
            self._run_once()
        File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
            handle._run()
        File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
            self._context.run(self._callback, *self._args)
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
        File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
        File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
            return await self.intercept(
        File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
            exit(1)
        File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
            raise SystemExit(code)
        SystemExit: 1'"""
    },
    "target": "text_generation_launcher",
    "span": {"rank": 0, "name": "shard-manager"},
    "spans": [{"rank": 0, "name": "shard-manager"}],
}

pranavthombare commented 4 months ago

I don't think its a model specific issue. I need to reproduce it with other models although this never used to happen pre TGI 2.0.

pranavthombare commented 3 months ago

Am able to reproduce it with mistral and llama models

pranavthombare commented 3 months ago

https://github.com/huggingface/text-generation-inference/pull/1736/files#diff-d92dc83f92b9c93839931357ef40af2ba48f62e5598a59e7478beebce4e5688eR26

I think this is the reason why.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference