huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.85k stars 1.04k forks source link

TGI keeps crashing with 'device-side assert triggered' #2121

Closed stefanobranco closed 1 month ago

stefanobranco commented 3 months ago

System Info

Text-generation-inference: v2.1.0+ Driver Version: 535.161.08 CUDA Version: 12.2 3 GPU: DGX with 8xH100 80GB

Information

Tasks

Reproduction

I'm running TGI with docker in a DGX with 8xH100.

docker run --restart=on-failure --env LOG_LEVEL=INFO --gpus all --ipc=host -p 8080:8080 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard 8 --port 8080 --max-input-length 34000 --max-total-tokens 32000 --max-batch-prefill-tokens 128000

Everything runs, but I get frequent crashes during inference. It happens with multiple models, but most frequently with WizardLM8x22B. At first I thought it had to do with cuda-graphs, but I think that was a red herring. It does appear that increasing max-batch-prefill-tokens causes the error to appear less.

I think it might be the same issue as this one here: https://github.com/huggingface/text-generation-inference/issues/1566?

2024-06-26T07:35:08.486443Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 91, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 261, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 146, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1094, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1047, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 651, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 583, in forward
    hidden_states = self.embed_tokens(input_ids)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 233, in forward
    torch.distributed.all_reduce(out, group=self.process_group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 77, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 50, in _get_msg_dict
    "args": f"{args}, {kwargs}",
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 464, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-06-26T07:35:08.486444Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'

During handling of the above exception, another exception occurred:

...

Expected behavior

If the environment supports the respective max-batch-size, it should be able to prefill without errors.

stefanobranco commented 3 months ago

Sometimes this also just causes the server to hang indefinitely it seems. I'll get a debug entry for generate, but nothing further happens:

DEBUG generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: None [...]
text_generation_router::server: router/src/server.rs:185: Input: [...]

edit: From what I can tell, final output before the server gets stuck:

2024-06-26T10:51:06.583336Z DEBUG next_batch{min_size=None max_size=None prefill_token_budget=96000 token_budget=177600}: text_generation_router::infer::v3::queue: router/src/infer/v3/queue.rs:318: Accepting entry
2024-06-26T10:51:06.583498Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583502Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583497Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583513Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583519Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583531Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583666Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583798Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.584074Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584080Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584079Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584087Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584107Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584111Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584120Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584127Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584135Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584140Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584148Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584162Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584165Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584176Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584206Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584223Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584231Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584265Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584273Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584277Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584319Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584416Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584429Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584473Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.654835Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:06.654847Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:08.983960Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:08.983972Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:09.583407Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:09.583418Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:10.209552Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.209563Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.464379Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.464390Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.641784Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.641795Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.903416Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:10.903427Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:11.489977Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [23, 239, 155, 130, 199, 243, 20, 8] }
2024-06-26T10:51:11.489988Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [23, 239, 155, 130, 199, 243, 20, 8] }

And then nothing, other than the recognition that further request come in as described above.

edit 2: Right before that I get what seems like a very large block allocation: Allocation: BlockAllocation { blocks: [9100, [...], 177598], block_allocator: BlockAllocator { block_allocator: UnboundedSender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f77f0004800, tail_position: 73 }, semaphore: Semaphore(0), rx_waker: AtomicWaker, tx_count: 2, rx_fields: "..." } } } } }

I'm sorry if this is not relevant, I'm just trying to provide every bit of information I can that stands out to me.

erfanium commented 2 months ago

same here upgrading from v2.0.1 to v2.1.0

bwhartlove commented 2 months ago

I've seen a similar issue with multi-gpu support seemingly non-functional after upgrading to v2.1.0 in my case. Once I disabled sharding, the issue subsided.

RohanSohani30 commented 2 months ago

when I load the model using docker on a single GPU it takes 11250GB GPU memory and with 2 shards it will take approximately the same memory on both GPUs. which doubles that of a single shard. sharding is supposed to split my model on two GPUs with approximately half of the initial size (for 2 GPUs).

sharding will work perfectly using TGI CLI but inference time is more in CLI. this may be due exllama, vllm and libraries are not installed. Do you happen to have any idea about it?

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.