How to Perform Inference with Batch Processing.

thunlp / Ouroboros

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)

Apache License 2.0

77 stars 9 forks source link

I'm currently using this model for inference, and I would like to know how to generate inference results in batch mode. Specifically, I'm trying to avoid processing inputs one by one and instead process multiple inputs in a single forward pass for efficiency.

Could you please provide guidance or examples on how to:

Structure inputs for batch processing.
Modify the inference pipeline to handle batches.
Optimize batch size for performance without running into memory issues.

Any advice, sample code, or references to the documentation would be greatly appreciated.

Thanks for your help!

thunlp / Ouroboros

How to Perform Inference with Batch Processing. #5