triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.46k forks source link

Feature Request: Server side callbacks to evict state #5453

Open jamied157 opened 1 year ago

jamied157 commented 1 year ago

Is your feature request related to a problem? Please describe. I want to use the PyTorch backend for my models which use sequential batching. These models contain states for each client internally and so they need to maintain an internal clock that will evict state when it thinks a client has timed out. Similar to how it is done in the sequential batcher example (I previously discussed this here: https://github.com/triton-inference-server/server/issues/4629).

However this can cause problems as the connection might close earlier before the internal timer is triggered. In this case, we may get requests from new clients that the server can't fulfil as it's still maintaining the state from the older connection.

One solution to this (mentioned in the issue above) is to use the implicit state management, but this isn't supported in the Pytorch backend. In addition, we lose a bit of flexibility in this case and I'd like to write code that does the state resetting myself.

Describe the solution you'd like Because Triton should know when a client loses connection, it should be fairly simple to evict the state when the connection is lost. Ideally I should be able to provide a callback function as part of the model that does this work and triton will call that when a connection is dropped.

Describe alternatives you've considered As mentioned above, the implicit state management would also solve this issue but we'd like the extra flexibility a callback provides.

kthui commented 1 year ago

Thanks for submitting the feature request. I have filed a ticket for us to investigate further.

GuanLuo commented 1 year ago

From Triton's perspective, the sequence batcher groups the requests based on the sequence id (/ correlation id) that is set as part of the request. In combination of "start" control flag, it detects an new sequence to be started even if it's already tracking a sequence with the same id, which matches your case where old client disconnected and new client has started a new sequence. Do you think your model checking can be extended to clean up internal state when it seems start flag is set (i.e. Triton is starting new sequence at the batch slot)?

jamied157 commented 1 year ago

I'm not sure that's quite the case we're in here, this issue tends to arise when we have lots of sequences attached to a server, one disconnects (maybe due to a timeout or similar) and another is taken from the candidate sequence pool and then made active. There shouldn't be any clashes of correlation IDs in this case as triton would evict a sequence we want to keep.

GuanLuo commented 1 year ago

Are you using "oldest sequence batching"? I think the case is easier for "direct sequence batching" that a batch slot is reserved for an sequence. So even if the correlation ID is different, when "start" flag is set for request in a given batch slot (now correlation ID is different), the model should know that it is a hint to start a new sequence at the given batch slot.

For "oldest sequence batching", the coupling between sequence and batch slot is loosen, which, I think, is why the stateful backend reference implementation has this internal timeout to evict pending state that hasn't been active for a long time.

With this context, I think your ask for callback is to alleviate the timeout implementation in the backend? Unfortunately some kind of sequence timeout is still used by Triton because Triton actually doesn't know when a client loses connection (CC @tanmayv25 ), the sequence batcher will check and evict stale sequence according to the timeout (max_sequence_idle_microseconds in config). Given that, informing the model to evict the sequence via callback can be worked around by using model internal timeout set to be the same as max_sequence_idle_microseconds.

jamied157 commented 1 year ago

Yes we're using the oldest strategy, we could potentially use a direct strategy, I haven't thought about it for a while. We'd need to make sure it doesn't affect the latency of the responses.

Given that, informing the model to evict the sequence via callback can be worked around by using model internal timeout set to be the same as max_sequence_idle_microseconds.

I think that's pretty close to what we're doing at the moment, but it's seemed a bit more finicky that I'd like and if somehow we queue up a sequence without having state slots available internally then we just have to reject that whole sequence.

If I call CloseStream() on the client side, maybe due to some error (so I don't send an END flag) - will triton queue up a new sequence? In that case, this internal timeout fix won't correctly evict state.

GuanLuo commented 1 year ago

will triton queue up a new sequence?

Can you elaborate? IIRC, max_candidate_sequences will cap the number of different sequences sent to the model at given model. If more new sequences are sent to Triton but >= max_candidate_sequences, then it will be put to backlog and wait for one of the candidate sequences to be finished (properly END) or to be released by Triton sequence scheduler due to scheduler timeout.

I do agree that we are missing a way for Triton to inform the model to reset any state associated with a sequence. I think it can be an additional control flag TERMINATED which Triton will send a null request with the flag similar to sending a "not ready" request.

jamied157 commented 1 year ago

Can you elaborate?

In the case where the client terminates a stream before sending an END flag. How does triton deal with the requests it has queued for that client? Does it just wait for the max_sequence_idle_microseconds or does it mark that candidate slot as available?

The TERMINATED flag sounds like it would work well for us!

GuanLuo commented 1 year ago

How does triton deal with the requests it has queued for that client? Does it just wait for the max_sequence_idle_microseconds or does it mark that candidate slot as available?

It will have to wait for max_sequence_idle_microseconds, there is not active health check from Triton to client so there is no way for Triton to know if something happens on the client side.

jamied157 commented 1 year ago

Okay that makes sense, I think then that a TERMINATED flag would really help us so this is probably a feature request for that

rizwanishaq commented 8 months ago

I have the similar problem, i.e. when max_sequence_idle_microseconds happens, I don't know, how to catch this inside the python_backend model?