Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time.
It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words
Previously reported in https://github.com/speechmatics/ctranslate2_triton_backend/issues/2#issuecomment-1546889761 by @aamir-s18
Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time.
Triton Server supports streaming with decoupled models
It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words