Open JoeNan1 opened 2 months ago
Can you print out and check if the aggregate_metrics['accept_counts']
makes sense? accept_counts
means how many token prediction from the draft model has been accepted by the verifier model. If it's too low, you can't get too much performance boost from speculative sampling.
I tested the Speculative Sampling method with llama2-7b and llama2-70b on the a800, but their boost effect was almost zero and negative in most cases.
llama2-7b base 103.25 tokens/s llama2-7b Speculative Sampling 104.52 tokens/s llama2-70b base 14.55tokens/s llama2-70b Speculative Sampling 13.41 tokens/s