openai / automated-interpretability

977 stars 116 forks source link

Problem about activation calculation #9

Open Daftstone opened 1 year ago

Daftstone commented 1 year ago

I would like to know how neuron activation is calculated and how to map neuron activation to each input token. Or can you provide me with related work on calculating neuron activation, I would be very grateful.

JacksonWuxs commented 1 year ago

Yes, I have the same question regarding to the calculation of token-level activations. It is not clear in both the paper and code. If anyone could give some hints, I would also be very grateful.

JacksonWuxs commented 1 year ago

Dear authors,

I found that this section provides the definition of neuron-token-level connection weights. First, I want to confirm if the word-neuron activation is extracted based on this section. I am confused because it seems that this activation does not take into account the context information. Specifically, according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :], the output weight of a neuron (l, n) to the token t appears to be independent of other tokens.

I would greatly appreciate it if someone could address my confusion and provide clarification on this matter.

Best, Xuansheng

WuTheFWasThat commented 1 year ago

yes, that's right - it doesn't take context information into account. it would probably be better to use something activation instead of weight based

msra-jqxu commented 2 months ago

Hi @JacksonWuxs , do you know how to calculate activation? This formula 'h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :]' seems to be independent of the sample and seems to have the same value for the same word in different sentences? Thanks!

JacksonWuxs commented 2 months ago

Hi @JacksonWuxs , do you know how to calculate activation? This formula 'h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :]' seems to be independent of the sample and seems to have the same value for the same word in different sentences? Thanks!

Well, per our discussions earlier, the initial implementation according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :] is independent to the sample context. This is a basic idea of LogitLens. One simple but effective way to improve it is by computing the average static word embeddings of a sequence, I guess, i.e. h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[T, :], where T = input_ids("a sequence of text").

msra-jqxu commented 2 months ago

Hi @JacksonWuxs , do you know how to calculate activation? This formula 'h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :]' seems to be independent of the sample and seems to have the same value for the same word in different sentences? Thanks!

Well, per our discussions earlier, the initial implementation according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :] is independent to the sample context. This is a basic idea of LogitLens. One simple but effective way to improve it is by computing the average static word embeddings of a sequence, I guess, i.e. h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[T, :], where T = input_ids("a sequence of text").

Hi @JacksonWuxs , Thanks for your answer! I agree with your idea. Actually, I am curious about how the activation value is obtained? Is it the h here? It seems that obtaining activation value does not require the use of logit lens.

JacksonWuxs commented 2 months ago

Hi @JacksonWuxs , do you know how to calculate activation? This formula 'h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :]' seems to be independent of the sample and seems to have the same value for the same word in different sentences? Thanks!

Well, per our discussions earlier, the initial implementation according to the equation h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :] is independent to the sample context. This is a basic idea of LogitLens. One simple but effective way to improve it is by computing the average static word embeddings of a sequence, I guess, i.e. h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[T, :], where T = input_ids("a sequence of text").

Hi @JacksonWuxs , Thanks for your answer! I agree with your idea. Actually, I am curious about how the activation value is obtained? Is it the h here? It seems that obtaining activation value does not require the use of logit lens.

Yes, in my opinion, you are right. The dot-product operation between the neuron weights and the word embeddings is actually the "activation value".