openai / sparse_autoencoder

MIT License
340 stars 35 forks source link

question about ablation sparsity #11

Closed lukaemon closed 4 months ago

lukaemon commented 4 months ago

In section 4.5:

This process leads to V logit differences per ablation and affected token, where V is the size of the token vocabulary. Because a constant difference at every logit does not affect the post-softmax probabilities, we subtract at each token the median logit difference value. Finally, we concatenate these vectors together across some set of T future tokens (at the ablated index or later) to obtain a vector of V · T total numbers.

Since it's V logit diff per ablation and token, shouldn't the aggregated tensor be (V, n_ablation, T), rather than (V, T)? Is there an implicit aggregation, mean/max? Thx!