TGI keeps crashing with 'device-side assert triggered'

System Info

Text-generation-inference: v2.1.0+ Driver Version: 535.161.08 CUDA Version: 12.2 3 GPU: DGX with 8xH100 80GB

Information

[x] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

I'm running TGI with docker in a DGX with 8xH100.

docker run --restart=on-failure --env LOG_LEVEL=INFO --gpus all --ipc=host -p 8080:8080 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard 8 --port 8080 --max-input-length 34000 --max-total-tokens 32000 --max-batch-prefill-tokens 128000

Everything runs, but I get frequent crashes during inference. It happens with multiple models, but most frequently with WizardLM8x22B. At first I thought it had to do with cuda-graphs, but I think that was a red herring. It does appear that increasing max-batch-prefill-tokens causes the error to appear less.

I think it might be the same issue as this one here: https://github.com/huggingface/text-generation-inference/issues/1566?

2024-06-26T07:35:08.486443Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 91, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 261, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 146, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1094, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1047, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 651, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 583, in forward
    hidden_states = self.embed_tokens(input_ids)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 233, in forward
    torch.distributed.all_reduce(out, group=self.process_group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 77, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 50, in _get_msg_dict
    "args": f"{args}, {kwargs}",
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 464, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 697, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 617, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-06-26T07:35:08.486444Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
    work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2395, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'

During handling of the above exception, another exception occurred:

...

Expected behavior

If the environment supports the respective max-batch-size, it should be able to prefill without errors.

Sometimes this also just causes the server to hang indefinitely it seems. I'll get a debug entry for generate, but nothing further happens:

DEBUG generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: None [...]
text_generation_router::server: router/src/server.rs:185: Input: [...]

edit: From what I can tell, final output before the server gets stuck:

2024-06-26T10:51:06.583336Z DEBUG next_batch{min_size=None max_size=None prefill_token_budget=96000 token_budget=177600}: text_generation_router::infer::v3::queue: router/src/infer/v3/queue.rs:318: Accepting entry
2024-06-26T10:51:06.583498Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583502Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583497Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583513Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583519Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583531Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583666Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583798Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.584074Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584080Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584079Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584087Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584107Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584111Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584120Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584127Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584135Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584140Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584148Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584162Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584165Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584176Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584206Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584223Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584231Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584265Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584273Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584277Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584319Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584416Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584429Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584473Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.654835Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:06.654847Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:08.983960Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:08.983972Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:09.583407Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:09.583418Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:10.209552Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.209563Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.464379Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.464390Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.641784Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.641795Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.903416Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:10.903427Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:11.489977Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [23, 239, 155, 130, 199, 243, 20, 8] }
2024-06-26T10:51:11.489988Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [23, 239, 155, 130, 199, 243, 20, 8] }

And then nothing, other than the recognition that further request come in as described above.

edit 2: Right before that I get what seems like a very large block allocation: Allocation: BlockAllocation { blocks: [9100, [...], 177598], block_allocator: BlockAllocator { block_allocator: UnboundedSender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f77f0004800, tail_position: 73 }, semaphore: Semaphore(0), rx_waker: AtomicWaker, tx_count: 2, rx_fields: "..." } } } } }

I'm sorry if this is not relevant, I'm just trying to provide every bit of information I can that stands out to me.

huggingface / text-generation-inference