pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.67k stars 513 forks source link

Reasons for the poor effect of Speculative Sampling #198

Open JoeNan1 opened 2 months ago

JoeNan1 commented 2 months ago

I tested the Speculative Sampling method with llama2-7b and llama2-70b on the a800, but their boost effect was almost zero and negative in most cases.

llama2-7b base 103.25 tokens/s llama2-7b Speculative Sampling 104.52 tokens/s llama2-70b base 14.55tokens/s llama2-70b Speculative Sampling 13.41 tokens/s

yanboliang commented 2 months ago

Can you print out and check if the aggregate_metrics['accept_counts'] makes sense? accept_counts means how many token prediction from the draft model has been accepted by the verifier model. If it's too low, you can't get too much performance boost from speculative sampling.