Implement AVEC in the interpretability app

jbloomAus commented 1 year ago

https://github.com/montemac/algebraic_value_editing/blob/main/scripts/basic_functionality.py

[x] Understand and describe the procedure used in the post.
[x] Understand if/how to apply this procedure in DTI

Implement it as part of DTI

Get single value variation on it working
Turn it into a scan
Plot downstream perturbations.
[x] Visualize the effect of applying scalar multiples of the vector as we do with RTG scanning.

jbloomAus commented 1 year ago

Questions:

how do I generate the vector?

They run a two token forward pass. I think I should run a one token forward pass since I don't have EOS. I do have the padding tokens though, so could pad these tokens to do the forward pass. I would then need to align it as we went forward which seems doable.

Where do I add it in? (layer, position): Can choose layer, they use resid_pre (editing the residual stream)

def get_resid_pre(prompt: str, layer: int):
    name = f"blocks.{layer}.hook_resid_pre"
    cache, caching_hooks, _ = model.get_caching_hooks(lambda n: n == name)
    with model.hooks(fwd_hooks=caching_hooks):
        _ = model(prompt)
    return cache[name]

def ave_hook(resid_pre, hook):
    if resid_pre.shape[1] == 1:
        return  # caching in model.generate for new tokens

    # We only add to the prompt (first call), not the generated tokens.
    ppos, apos = resid_pre.shape[1], act_diff.shape[1]
    assert apos <= ppos, f"More mod tokens ({apos}) then prompt tokens ({ppos})!"

    # add to the beginning (position-wise) of the activations
    resid_pre[:, :apos, :] += coeff * act_diff

jbloomAus commented 1 year ago

How do I apply this procedure in DTI?

Take the residual stream of the forward pass at some layer and inject it into the same layer (at some sequence position).
What would this mean in the memory env? An obvious candidate is to inject the residual stream of layer two from a model with one goal into the residual stream of a model with a seperate goal. This is essentially activation patching in reverse. Take the corrupted pass residual stream and add it to the residual stream of another path.

I think the easiest way for me to do this, is to do it which the AVEC code. I could parameterise it so we can get more info on the outcomes (eg, layer, head etc)

jbloomAus / DecisionTransformerInterpretability

Implement AVEC in the interpretability app #72

Implement it as part of DTI