Closed stefanobranco closed 1 month ago
Sometimes this also just causes the server to hang indefinitely it seems. I'll get a debug entry for generate, but nothing further happens:
DEBUG generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: None [...]
text_generation_router::server: router/src/server.rs:185: Input: [...]
edit: From what I can tell, final output before the server gets stuck:
2024-06-26T10:51:06.583336Z DEBUG next_batch{min_size=None max_size=None prefill_token_budget=96000 token_budget=177600}: text_generation_router::infer::v3::queue: router/src/infer/v3/queue.rs:318: Accepting entry
2024-06-26T10:51:06.583498Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583502Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583497Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583513Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583519Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583531Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583666Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.583798Z DEBUG batch{batch_size=1}:prefill:prefill{id=36 size=1}:prefill{id=36 size=1}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-06-26T10:51:06.584074Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584080Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584079Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584087Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584107Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584111Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584120Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584127Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584135Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584140Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584148Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584162Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584165Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584176Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584206Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584223Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584231Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584265Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584273Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584277Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584319Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.584416Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(24717), flags: (0x4: END_HEADERS) }
2024-06-26T10:51:06.584429Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717) }
2024-06-26T10:51:06.584473Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(24717), flags: (0x1: END_STREAM) }
2024-06-26T10:51:06.654835Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:06.654847Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [230, 203, 84, 34, 176, 210, 115, 2] }
2024-06-26T10:51:08.983960Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:08.983972Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [96, 75, 106, 55, 0, 178, 95, 167] }
2024-06-26T10:51:09.583407Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:09.583418Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [255, 94, 167, 52, 201, 71, 56, 69] }
2024-06-26T10:51:10.209552Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.209563Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [194, 95, 72, 29, 208, 65, 68, 93] }
2024-06-26T10:51:10.464379Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.464390Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [12, 206, 178, 92, 23, 251, 21, 144] }
2024-06-26T10:51:10.641784Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.641795Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [243, 91, 107, 187, 113, 48, 53, 194] }
2024-06-26T10:51:10.903416Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:10.903427Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [89, 55, 95, 85, 205, 74, 65, 44] }
2024-06-26T10:51:11.489977Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_read.rs:405: received frame=Ping { ack: false, payload: [23, 239, 155, 130, 199, 243, 20, 8] }
2024-06-26T10:51:11.489988Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.26/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [23, 239, 155, 130, 199, 243, 20, 8] }
And then nothing, other than the recognition that further request come in as described above.
edit 2:
Right before that I get what seems like a very large block allocation:
Allocation: BlockAllocation { blocks: [9100, [...], 177598], block_allocator: BlockAllocator { block_allocator: UnboundedSender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f77f0004800, tail_position: 73 }, semaphore: Semaphore(0), rx_waker: AtomicWaker, tx_count: 2, rx_fields: "..." } } } } }
I'm sorry if this is not relevant, I'm just trying to provide every bit of information I can that stands out to me.
same here upgrading from v2.0.1 to v2.1.0
I've seen a similar issue with multi-gpu support seemingly non-functional after upgrading to v2.1.0 in my case. Once I disabled sharding, the issue subsided.
when I load the model using docker on a single GPU it takes 11250GB GPU memory and with 2 shards it will take approximately the same memory on both GPUs. which doubles that of a single shard. sharding is supposed to split my model on two GPUs with approximately half of the initial size (for 2 GPUs).
sharding will work perfectly using TGI CLI but inference time is more in CLI. this may be due exllama, vllm and libraries are not installed. Do you happen to have any idea about it?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Text-generation-inference: v2.1.0+ Driver Version: 535.161.08 CUDA Version: 12.2 3 GPU: DGX with 8xH100 80GB
Information
Tasks
Reproduction
I'm running TGI with docker in a DGX with 8xH100.
docker run --restart=on-failure --env LOG_LEVEL=INFO --gpus all --ipc=host -p 8080:8080 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard 8 --port 8080 --max-input-length 34000 --max-total-tokens 32000 --max-batch-prefill-tokens 128000
Everything runs, but I get frequent crashes during inference. It happens with multiple models, but most frequently with WizardLM8x22B. At first I thought it had to do with cuda-graphs, but I think that was a red herring. It does appear that increasing max-batch-prefill-tokens causes the error to appear less.
I think it might be the same issue as this one here: https://github.com/huggingface/text-generation-inference/issues/1566?
Expected behavior
If the environment supports the respective max-batch-size, it should be able to prefill without errors.