Currently, for streaming, we allow for the user to input a starting LM state. This might work ok in some cases but there are some issues:
We start from an empty beam every time - this is fine if there is not much correlation from chunk to chunk, but for streaming we ideally want a mode where the logits can be split up without affecting the results. So, we need an endpoint where the full beam information is included and the scoring caches are preserved between calls until the user clears them.
The code to use this will be a bit different since the user has to do more management of state objects. An alternate approach would be to save the state within the decoder object so that the user just has to call initialize and clear functions or using kwargs in the partial decode function - I'm not sure what's better here
This issue was pointed out a while ago in an issue but hasn't been addressed yet.
Currently, for streaming, we allow for the user to input a starting LM state. This might work ok in some cases but there are some issues:
We start from an empty beam every time - this is fine if there is not much correlation from chunk to chunk, but for streaming we ideally want a mode where the logits can be split up without affecting the results. So, we need an endpoint where the full beam information is included and the scoring caches are preserved between calls until the user clears them.
The code to use this will be a bit different since the user has to do more management of state objects. An alternate approach would be to save the state within the decoder object so that the user just has to call initialize and clear functions or using kwargs in the partial decode function - I'm not sure what's better here
This issue was pointed out a while ago in an issue but hasn't been addressed yet.