Open jamied157 opened 7 months ago
Okay have managed to create a reproduction independent of our setup. Firstly you need to define a torch model like so
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super().__init__()
def forward(self, inputs, state):
print("Executing with; ", inputs, state)
inputs = inputs + state
state = inputs + state
return inputs, state
def main():
scripted_model = torch.jit.script(Model())
torch.jit.save(scripted_model, "simple_model.pt")
if __name__ == "__main__":
main()
Then create two models with identical model configs
name: "simple_model"
max_batch_size: 1
input {
name: "INPUT__0"
data_type: TYPE_FP16
dims: 5
}
output {
name: "OUTPUT__0"
data_type: TYPE_FP16
dims: 5
}
sequence_batching {
state {
input_name: "INPUT__1"
output_name: "OUTPUT__0"
data_type: TYPE_FP16
dims: 5
initial_state: {
data_type: TYPE_FP16
dims: 5
zero_data: true
name: "initial state"
}
}
max_sequence_idle_microseconds: 10000000
oldest {
max_queue_delay_microseconds: 50000000
max_candidate_sequences: 2
}
}
backend: "pytorch"
And join them together with this ensemble model
name: "ensemble_model"
platform: "ensemble"
max_batch_size: 1
input {
name: "INPUT__0"
data_type: TYPE_FP16
dims: 5
}
output {
name: "OUTPUT__0"
data_type: TYPE_FP16
dims: 5
}
ensemble_scheduling {
step {
model_name: "simple_model"
model_version: -1
input_map {
key: "INPUT__0"
value: "INPUT__0"
}
output_map {
key: "OUTPUT__0"
value: "SHARED_INPUT"
}
}
step {
model_name: "simple_model_2"
model_version: -1
input_map {
key: "INPUT__0"
value: "SHARED_INPUT"
}
output_map {
key: "OUTPUT__0"
value: "OUTPUT__0"
}
}
}
Launch a triton server and target it with this script: https://gist.github.com/jamied157/a047d6314d83499c9811a642a895ad2a
This launches 3 clients against the server, two of them send 0's and one sends nan
inputs.
Then
Sequence: 208590 hit output_count: 6
twice.In the original terminal you should eventually get a message like
Value 0, sequence 654216, returned a different output, got [[nan nan nan nan nan]]
In the logs you should also be able to see a line like
Executing with; 0 0 0 0 0
[ CUDAHalfType{1,5} ] nan nan nan nan nan
[ CUDAHalfType{1,5} ]
This shows the state for one of the zero sequences has been contaminated with nans.
I also have logs from log_verbose=1
which I've attached
simple_triton_output.log
Quick update on this, we think we've found the issue. We're seeing multiple calls to SequenceBatchScheduler::ReleaseSequenceSlot
: https://github.com/triton-inference-server/core/blob/main/src/sequence_batch_scheduler/sequence_batch_scheduler.cc#L880 for the same sequence slot before the slot can be cleaned up correctly.
This results in duplicate sequence slots being added to ready_batcher_seq_slots_
here: https://github.com/triton-inference-server/core/blob/main/src/sequence_batch_scheduler/sequence_batch_scheduler.cc#L958
@Tabrizian I have found the root cause and it isn't just affecting ensemble models. https://github.com/triton-inference-server/core/pull/341. but potentially all sequence models that use implicit state and are accessed with the gRPC client TL;DR: gRPC request cancellation trigger a slot release in the sequence batcher, but the very same slot will still be released by the sequence batche reaper thread, leading to potentially having new clients share the same slot.
Thanks for proposing a fix @HennerM and filing a detailed a GitHub issue @jamied157. We'll take a look at this and get back to you.
Just to add a bit more detail:
Hi @jamied157,
Thanks for such detailed repro steps and investigation!
I had a quick follow-up question from your description so far. You mentioned that you can reproduce this without request cancellation:
We're still seeing some issues with versions of triton before 23.10 where cancellation wasn't implemented
But it looks your repro steps involve triggering a request cancellation by closing the client connection from the original grpc client script:
- Then run the script in another terminal
- Cancel the original script and rerun
Just to clarify - is the reproducible behavior the same prior to 23.10? Or are there different symptoms and triggers?
Another question @jamied157 @HennerM - given Henner's proposed fix for the sequence batcher issue for a standalone sequence model (no ensemble), and:
My Repro above can cause a slightly different issue where cancelled requests from the ensemble can free a sequence multiple times
Do you have any kind of ensemble repro for an ensemble of just 1 model, rather than a pipeline of 2 for the sake of something more minimal?
I have repro Python script which should work if you have pytorch and tritonserver installed: https://gist.github.com/HennerM/6ae28bc0360aae96bc5feb422aa1375c
or if you already have the model repository setup that Jamie mentioned above then you should be able to run a variation of this:
import asyncio
import functools
import gc
import os
from typing import Optional
import numpy as np
from tritonclient.grpc import InferenceServerClient as SyncInferenceServerClient
from tritonclient.grpc.aio import InferenceServerClient, InferInput, InferResult
def create_input(seq_id: int):
input = [InferInput("INPUT__0", [1, FIXED_LAST_DIM], "INT64")]
input[0].set_data_from_numpy( np.full([1, FIXED_LAST_DIM], fill_value=seq_id, dtype=np.int64))
return input
def check_result(seq_id: int, res: asyncio.Future[Optional[InferResult]]):
try:
r = res.result()
assert r is not None
np_res = r.as_numpy("OUTPUT__0")
assert np_res is not None
if not np.all(np_res % seq_id == 0):
print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True)
sys.exit(1)
except asyncio.CancelledError:
pass
async def continuous_inference(
client: InferenceServerClient,
model,
offset: int,
candidates=1,
max_requests=10,
time_between_requests=MAX_IDLE_SECONDS * 0.5,
end_on_cancel: bool = True,
):
for seq_id in range(offset, offset + candidates):
print(f"Starting stream {seq_id}")
t = asyncio.create_task(
client.infer(
model,
create_input(seq_id),
sequence_start=True,
sequence_id=seq_id,
request_id=f"{seq_id}_init",
)
)
t.add_done_callback(functools.partial(check_result, seq_id))
await asyncio.sleep(time_between_requests)
for req_id in range(max_requests):
for seq_id in range(offset, offset + candidates):
print(f"Sending request {req_id} for {seq_id}")
t = asyncio.create_task(
client.infer(
model,
create_input(seq_id),
sequence_id=seq_id,
request_id=f"{seq_id}_{req_id}",
sequence_end=req_id==max_requests-1,
)
)
t.add_done_callback(functools.partial(check_result, seq_id))
await asyncio.sleep(time_between_requests)
TARGET_MODEL = "simple_model"
async def simple_cancellation():
print("Wait for Triton to be ready")
await asyncio.sleep(2)
client = InferenceServerClient(TRITON_URL)
streams_1 = asyncio.create_task(
continuous_inference(client, TARGET_MODEL, 100, candidates=1, max_requests=5, time_between_requests=0.1)
)
await asyncio.sleep(0.3)
print("Cancelling stream")
streams_1.cancel()
await client.close()
del client
gc.collect()
print("Wait for timeout")
await asyncio.sleep(delay=MAX_IDLE_SECONDS + 2)
bob = InferenceServerClient(TRITON_URL)
print("Starting new streams")
await asyncio.create_task(
continuous_inference(bob, TARGET_MODEL, offset=200, candidates=4, max_requests=5, time_between_requests=1)
)
Just to clarify - is the reproducible behavior the same prior to 23.10? Or are there different symptoms and triggers?
We only have reproduced similar behaviour, that is multiple sequences getting assigned the same slot and state, in our internal production Triton setup, and didn't manage yet to get this into a min-repro. We wil get back to you in case we manage to reproduce without cancellation.
Thanks @HennerM, appreciate the diligence here!
So the current status is:
Is that right?
In the meantime, I can help take a look into (2) to see if I can find anything that looks logically related, but a reproducer would greatly help expedite that.
Thanks a lot. Definitely appreciate the help here. On (1) the review and test help will be good, as I struggled in the past to get the test setup working on my Maschine.
on (2) I need to clarify that we actually haven’t noticed the wrong behaviour with the patch above yet, it’s just that the patch deals with a problem with cancellation so we have a belief that this isn’t quite fixing the problem. for us the sequence batcher code, potentially couple with ensembles is quite complex and there are a lot of edge cases that can happen in that combination. We just want to be double sure that no edge cases are missed.
Hi @HennerM, we're looking into this single-model reproducer you shared above, but have a few questions.
The code doesn't run as-is, it required a few best-guess modifications to try it out. Can you upload a complete script runnable as-is, just to make sure we're running the same thing? My local edits to get something running are below for reference (it also required modifying the config.pbtxt
from TYPE_BF16
to TYPE_INT64
(or np.float16
vice versa),and I otherwise used the simple_model.pt
as-is:
< import sys
12,15d10
< MAX_IDLE_SECONDS=10
< TRITON_URL="localhost:8001"
< FIXED_LAST_DIM=5
<
95,97d89
<
< if __name__ == "__main__":
< asyncio.run(simple_cancellation())
I'm assuming this check is supposed to get triggered for the reproduction, but I ran the script quite a few times on 24.03 release and never hit this case. Am I missing some steps to reproduce?
print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True)
Having this simpler single-model reproducer that does cancellation in-line would be really helpful for further debugging, so any extra details you can share or step-by-step instructions with complete scripts to get it working would be very helpful!
Sorry I am on UK hours, here is an updated script: https://gist.github.com/HennerM/cde5037442a2efa4238246cc5ccb7b51
I think I didn't leave enough time for the triton server to start, let me know if you have more questions
Thanks @HennerM! I was able to reproduce the crossover with your updated gist. Will reach out if we have more questions.
Description We're still in progress in diagnosing this issue but we're seeing some strange behaviour and would like to get some feedback/ideas from others.
We're using a setup that relies on two separate models using the oldest sequence scheduler combined with the ensemble model. Essentially
Input -> Model_1 -> Model_2 -> Output
Where model 1 and model 2 use implicit state management with the PyTorch backend, they have an initial state but then use the output from the model to update the state.
Our issue tends to arise when the triton is under load, if we use up all the candidate sequence slots - then we don't send END flags on any of them and instead let triton time them out. We then see that Model_2 is in a bad state, in this state, if we run requests against model 2, it's possible to leak information across requests and corrupt the outputs of the model. It tends to look something like:
Here we're displaying the outputs of the model, we've corrupted one batch with
NaN
values and used random inputs for others - however the NaN values leak into the next request sent to the server (here the batches are run sequentially so leaking happens across executions rather than within one)Given all this, it feels like something isn't quite working correctly with the interaction between the ensemble, the sequence timeout and implicit state management.
Triton Information What version of Triton are you using?
We've reproed on NGC version 22.12, 23.10, 24.02
Are you using the Triton container or did you build it yourself?
We've reproed on containers based on the triton container and on versions built by us.
To Reproduce We're working on this, currently we only can reproduce with our own models and configuration which we can't share. Hopefully we can find a repro with some simpler models. The basic setup is two sequential models using implicit state management with the PyTorch backend - connected together with the ensemble scheduler. The you need to timeout some/all of the candidate sequences to get
Model_2
into a bad state. OnceModel_2
is in this state you can fairly reliably see the state corruption reported above.We still can't reliably reproduce this state however so we can't provide much more info.
Expected behavior We shouldn't be able to affect the outputs of the model with the inputs from a different client.