Acceptance probs: [0.06866952789699571, 0.03862660944206009, 0.055793991416309016, 0.06866952789699571, 0.030042918454935622, 0.7381974248927039]
Mean Accepted: 4.167381974248927
Average tokens/sec: 76.72
Memory used: 22.38 GB
Which makes sense. I was playing around with the model and added just one line in Transformer's forward method:
def forward(self, idx: Tensor, input_pos: Optional[Tensor] = None) -> Tensor:
assert self.freqs_cis is not None, "Caches must be initialized first"
mask = self.causal_mask[None, None, input_pos]
freqs_cis = self.freqs_cis[input_pos]
x = self.tok_embeddings(idx)
for i, layer in enumerate(self.layers):
x = layer(x, input_pos, freqs_cis, mask)
x = self.norm(x)
self.inner_state = x #NEW LINE
logits = self.output(x)
return logits
Now when I run generate with the same command, I get really low acceptance rate:
Acceptance probs: [0.5620253164556962, 0.3632911392405063, 0.07088607594936709, 0.0037974683544303796, 0.0, 0.0]
Mean Accepted: 0.5164556962025316
Average tokens/sec: 24.87
Memory used: 22.12 GB
But if I don't pass --compile I get the same acceptance rate as before:
Acceptance probs: [0.07142857142857142, 0.05102040816326531, 0.04591836734693878, 0.05102040816326531, 0.07142857142857142, 0.7091836734693877]
Mean Accepted: 4.127551020408164
Average tokens/sec: 24.03
Memory used: 22.10 GB
I cloned the gpt-fast repo, and tried it out with Llama-3, to setup I ran the following code:
Now I ran generate with speculative decoding:
I get:
Which makes sense. I was playing around with the model and added just one line in Transformer's forward method:
Now when I run generate with the same command, I get really low acceptance rate:
But if I don't pass
--compile
I get the same acceptance rate as before:My question is why is this one line causing that drastic decline in quality when using compile? Here is the commit with the change in my fork: https://github.com/kalradivyanshu/gpt-fast/commit/20bd67360daf1e778f4ca1289cfbf12225c42be7
Any insights will really be appreciated. Thankyou!