speechmatics / ctranslate2_triton_backend

Triton backend for https://github.com/OpenNMT/CTranslate2
MIT License
32 stars 4 forks source link

Investigate support for streaming mode #6

Open HennerM opened 1 year ago

HennerM commented 1 year ago

Previously reported in https://github.com/speechmatics/ctranslate2_triton_backend/issues/2#issuecomment-1546889761 by @aamir-s18

Streaming mode could be useful for very big models. It can help in real-time use cases where we can improve User Experience by generating one token at a time.

Triton Server supports streaming with decoupled models

It needs to be investigated how CTranslate can be used to get decoded tokens one-by-one. Additionally this might be trickier in a beam decode setting, unless we are willing to always return the best guess which could flip previous words

aamir-s18 commented 1 year ago

I think this here is relevant for this issue.