For our eval gauntlet, modify the prompt to place "thinking placeholder tokens" (. or something) at the beginning and end of the prompt.
Control
prompt: <prompt>, response: <response>
Eval Treatment A (for various numbers of . characters)
.prompt: <prompt>, response: <response>
Eval Treatment B (for various numbers of . characters)
propmt: <prompt>, .response: <response>
Measure: eval gauntlet performance
What I expect could happen (pre-registering)
planning+decompression world (my prediction): treatment B > treatment A > control
The thinking tokens in the beginning of the sequence are used to "decompress" the model into the kv cache, which normally takes place in the background over a large number of tokens, and wouldn't take place at all if the prompt is too small. Since bigger context = more flops, adding thinking tokens to the beginning of the context will cause increased performance. However, adding thinking tokens to the middle of the context, where they benefit from seeing the prompt, allows the model to inflate its kv metamodel with more relevant data. I'll call the before-prompt thinking tokens decompression tokens and the the after-prompt thinking tokens planning tokens.
decompression-only world (would be the most surprising and cool): Treatment B == treatment A > control
The effect often attributed to planning tokens actually is due to decompression tokens. Flops/output token is the dominant bit in determining output token quality.
planning-only world (conventional wisdom): Treatment B > treatment A == control
indifferent world (also conventional wisdom): (treatment B == treatment A == control)
distracted world (softmax sensitivity too high): control > treatments (putting a bunch of unrelated tokens into the context window could cause it to be less accurate)
disconnected world (wrong . token): Treatment A > control > treatment B
In particular, separating the question and the answer with many dividing tokens will cause positional embedding or alibi or whatever to mess up.
The consequences of 1 and 2 would be that you can "grow" a small model (with no data at all) to make it more capable.
For our eval gauntlet, modify the prompt to place "thinking placeholder tokens" (
.
or something) at the beginning and end of the prompt.prompt: <prompt>, response: <response>
.
characters).prompt: <prompt>, response: <response>
.
characters)propmt: <prompt>, .response: <response>
Measure: eval gauntlet performance
What I expect could happen (pre-registering)
.
token): Treatment A > control > treatment B In particular, separating the question and the answer with many dividing tokens will cause positional embedding or alibi or whatever to mess up.The consequences of 1 and 2 would be that you can "grow" a small model (with no data at all) to make it more capable.
Experiment YAML
Results