huggingface / tgi-gaudi

Large Language Model Text Generation Inference on Habana Gaudi
http://hf.co/docs/text-generation-inference
Apache License 2.0
28 stars 47 forks source link

When running llama2 7b, inference some 2k length prompt concurrently will cause TGI service crash. #216

Closed yao531441 closed 3 days ago

yao531441 commented 3 months ago

System Info

image

Information

Tasks

Reproduction

docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64

Error log

2024-08-30T02:09:44.146922Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558739096s" validation_time="2.976352ms" queue_time="23.703336184s" inference_time="28.852426791s" time_per_token="57.704853ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:44.877111Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558665697s" validation_time="1.514834ms" queue_time="23.709660453s" inference_time="28.847490833s" time_per_token="57.694981ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:45.863818Z ERROR text_generation_launcher: Method Decode encountered an error.
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
    generations, next_batch, timings = self.model.generate_token(batches)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
    batch.logits = self.forward(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
    return self.model.forward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
    cached.graph.replayV3(input_tensor_list, cached.asynchronous)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
    _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078
2024-08-30T02:09:46.039747Z ERROR batch{batch_size=16}:decode:decode{size=16}:decode{size=16}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-08-30T02:09:47.968375Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
2024-08-30T02:09:47.968553Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(72)}:clear_cache{batch_id=Some(72)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.968584Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968613Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968632Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968649Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED

batch{batch_size=1}:prefill:clear_cache{batch_id=Some(74)}:clear_cache{batch_id=Some(74)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969441Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969517Z ERROR batch{batch_size=1}:prefill:prefill{id=75 size=1}:prefill{id=75 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969560Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(75)}:clear_cache{batch_id=Some(75)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969575Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969645Z ERROR batch{batch_size=1}:prefill:prefill{id=76 size=1}:prefill{id=76 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969705Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(76)}:clear_cache{batch_id=Some(76)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969720Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969791Z ERROR batch{batch_size=1}:prefill:prefill{id=77 size=1}:prefill{id=77 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969834Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(77)}:clear_cache{batch_id=Some(77)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969849Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969917Z ERROR batch{batch_size=1}:prefill:prefill{id=78 size=1}:prefill{id=78 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969955Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(78)}:clear_cache{batch_id=Some(78)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969970Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970036Z ERROR batch{batch_size=1}:prefill:prefill{id=79 size=1}:prefill{id=79 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970078Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(79)}:clear_cache{batch_id=Some(79)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970094Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970160Z ERROR batch{batch_size=1}:prefill:prefill{id=80 size=1}:prefill{id=80 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970198Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(80)}:clear_cache{batch_id=Some(80)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970213Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970278Z ERROR batch{batch_size=1}:prefill:prefill{id=81 size=1}:prefill{id=81 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970318Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(81)}:clear_cache{batch_id=Some(81)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970334Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.000537Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM       : 2113389016 KB
------------------------------------------------------------------------------
Exception ignored in: <function Server.__del__ at 0x7f611e95c790>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_server.py", line 194, in __del__
    cygrpc.schedule_coro_threadsafe(
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "/usr/lib/python3.10/asyncio/base_events.py", line 436, in create_task
    self._check_closed()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Decode]' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
    generations, next_batch, timings = self.model.generate_token(batches)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
    batch.logits = self.forward(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
    return self.model.forward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
    cached.graph.replayV3(input_tensor_list, cached.asynchronous)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
    _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 831, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 33, in intercept
    exit(1)
  File "/usr/lib/python3.10/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 1 rank=0
2024-08-30T02:09:48.006377Z ERROR batch{batch_size=1}:prefill:prefill{id=82 size=1}:prefill{id=82 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006461Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(82)}:clear_cache{batch_id=Some(82)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006484Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)

2024-08-30T02:09:48.062231Z ERROR batch{batch_size=1}:prefill:prefill{id=118 size=1}:prefill{id=118 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062267Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(118)}:clear_cache{batch_id=Some(118)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062276Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062891Z ERROR batch{batch_size=1}:prefill:prefill{id=119 size=1}:prefill{id=119 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062914Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(119)}:clear_cache{batch_id=Some(119)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062921Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063923Z ERROR batch{batch_size=1}:prefill:prefill{id=120 size=1}:prefill{id=120 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063944Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(120)}:clear_cache{batch_id=Some(120)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063951Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065252Z ERROR batch{batch_size=1}:prefill:prefill{id=121 size=1}:prefill{id=121 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065270Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(121)}:clear_cache{batch_id=Some(121)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065275Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076439Z ERROR batch{batch_size=1}:prefill:prefill{id=122 size=1}:prefill{id=122 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076463Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(122)}:clear_cache{batch_id=Some(122)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076470Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.157110Z  INFO text_generation_launcher: webserver terminated
2024-08-30T02:09:48.157132Z  INFO text_generation_launcher: Shutting down shards
Error: ShardFailed

Expected behavior

TGI serve will return correct output result.

yao-matrix commented 2 months ago

@yuanwu2017 is looking into it.

yuanwu2017 commented 2 months ago

I can reproduce this issue. It is OOM issue. Debugging in progress.

yuanwu2017 commented 3 days ago

In this case the warmup is not done because of the incorrect parameters of startup. I changed the parameters of startup, the OOM happened in warmup process. So it means llama2-7B cannot run with batch_size 32, it causes the OOM issue. PREFILL_BATCH_BUCKET_SIZE=1 Because the max_prefill_batch_size = max-batch-prefill-tokens/max-input-lentgh=2048/2048=1

max_decode_batch_size=max-batch-total-tokens/max-total-tokens=65536/4096=16, but BATCH_BUCKET_SIZE=32. You need to set max-batch-total-tokens as 131072.

docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 131072 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64

yuanwu2017 commented 3 days ago

Optimum-habana also cannot support the batch_size=32 and max_input_length=2048. https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation Command: python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 2048 --max_input_tokens 2048 --do_sample --batch_size 32 --prompt "How are you?" --bf16

image
yuanwu2017 commented 3 days ago

@regisss @mandy-li Please close this issue.

yuanwu2017 commented 3 days ago

@yao-matrix