openai / automated-interpretability

940 stars 110 forks source link

Problem about activation calculation #9

Open Daftstone opened 1 year ago

Daftstone commented 1 year ago

I would like to know how neuron activation is calculated and how to map neuron activation to each input token. Or can you provide me with related work on calculating neuron activation, I would be very grateful.

JacksonWuxs commented 1 year ago

Yes, I have the same question regarding to the calculation of token-level activations. It is not clear in both the paper and code. If anyone could give some hints, I would also be very grateful.

JacksonWuxs commented 1 year ago

Dear authors,

I found that this section provides the definition of neuron-token-level connection weights. First, I want to confirm if the word-neuron activation is extracted based on this section. I am confused because it seems that this activation does not take into account the context information. Specifically, according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :], the output weight of a neuron (l, n) to the token t appears to be independent of other tokens.

I would greatly appreciate it if someone could address my confusion and provide clarification on this matter.

Best, Xuansheng

WuTheFWasThat commented 1 year ago

yes, that's right - it doesn't take context information into account. it would probably be better to use something activation instead of weight based