triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Avoid duplicate state allocations #6215

Closed david-macleod closed 10 months ago

david-macleod commented 1 year ago

SequenceStates objects have separate allocations for input_states_ and output_states_. output_states_ is written to and the input_states_ is read from. After a batch is executed they are swapped in SetStateUpdateCallback so we can correctly read the updated state in the next iteration, and write the next output to the allocation previously used for input_states_.

My question is why do we need both? In the scenario where the states correspond to multiple GB it is quite costly to maintain both, naively I would expect to be able to write to the the same memory for the output and then pass that back as the input, but is there some reason this should be avoided (perhaps an edge case).

Put another way, If I were to update SetStateUpdateCallback to be a no-op and have input_states_ and output_states always point to the same memory should I expect issues?

krishung5 commented 1 year ago

Hi @david-macleod, thanks for bringing this up. I don't see Triton needs both of the input and output states to be available at the same time. I think we may be able to use the same memory allocation for input and output states. I've filed a ticket(DLIS-5335) for this optimization. Meanwhile, feel free to make updates to the code. We encourage external contributions for this project!

Tabrizian commented 10 months ago

This has been added in 23.11 release. Please see the following model configuration option for more details:

https://github.com/triton-inference-server/common/blob/c8ce7c7dba7903d8d17c5d80b0cc9781d1d1626d/protobuf/model_config.proto#L1403C10-L1403C42