Open kamilakesbi opened 2 months ago
Cool, looking forward for the PR! The last time a contributor worked on it, there were issues with latency due to excessive padding (https://github.com/huggingface/transformers/issues/29769), hope your PR solves them :)
Feature request
Speculative decoding isn't currently enabled for batch sizes >1. PR #26875 was previously open to add this feature to main, but never merged. As the PR is quite old and has been closed, I'm opening an issue to motivate the addition of this feature to Transformers.
Two approaches can be implemented to enable speculative decoding with batch size >1:
This is a simple approach, but rather naive, since some valid tokens would have to be regenerated during the successive iterations of the assistant model.
In this way, at each step, we would keep all the valid tokens and wouldn't need to regenerate them during future iterations of the assistant model.
The second approach is better, and PR #26875 already implements most of the solution, so we should focus on that one IMO.
How to reproduce:
Curent output:
Expected output:
Your contribution
I started iterating on the solution and will open a PR soon to solve it :)
cc @sanchit-gandhi @ylacombe @gante