Open Chlience opened 1 month ago
Different sentences within a batch may have various acceptance lengths in speculative sampling [1,2], thus requires careful padding, scheduling, or custom kernel implementation. We haven't support batch processing in our implementation now.
[1] Liu Xiaoxuan, et al. Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. 2024. [2] Qian Haifeng, et al. BASS: Batched Attention-optimized Speculative Sampling. 2024.
I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.
Could you please provide guidance or examples on how to:
Any advice, sample code, or references to the documentation would be greatly appreciated.
Thanks for your help!