thunlp / Ouroboros

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
Apache License 2.0
77 stars 9 forks source link

How to Perform Inference with Batch Processing. #5

Open Chlience opened 1 month ago

Chlience commented 1 month ago

I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.

Could you please provide guidance or examples on how to:

  1. Structure inputs for batch processing.
  2. Modify the inference pipeline to handle batches.
  3. Optimize batch size for performance without running into memory issues.

Any advice, sample code, or references to the documentation would be greatly appreciated.

Thanks for your help!

Achazwl commented 1 month ago

Different sentences within a batch may have various acceptance lengths in speculative sampling [1,2], thus requires careful padding, scheduling, or custom kernel implementation. We haven't support batch processing in our implementation now.

[1] Liu Xiaoxuan, et al. Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. 2024. [2] Qian Haifeng, et al. BASS: Batched Attention-optimized Speculative Sampling. 2024.