How to extract the relevance of neurons in FFN layers

ChangWenhan commented 6 months ago

Hello! The code you’ve published runs smoothly, but how to extract the relevance of neurons in different FFN layers of a large language model using the code in this repository?

What’s more, we are looking forward to the parts of Latent Feature Visualization and Perturbation Evaluation

Thank you!

rachtibat commented 6 months ago

Hey,

Thank you for your interest! Unfortunately, I was sick, on holidays and now preparing the camera ready version of the paper (: Next month (hopefully), the whole library should be finally published alongside the paper!

To extract latent relevances, you can simply use a backward hook in PyTorch. For instance, in a Phi model we have three layers in a FFN: 1. linear, 2. activation function (knowledge neurons), 3. linear

If you want to get the relevance at the knowledge neurons, you must save the gradients at the output of (2).

def append_backward_hooks_to_mlp(model):

    rel_layer = {}
    def generate_hook(layer_name):
        def backward_hook(module, input_grad, output_grad):
            # cloning the relevance makes sure, it is not modified through memory-optimized LXT inplace operations if used
            rel_layer[layer_name] = output_grad.clone()
        return backward_hook

    # append hook to last activation of mlp layer
    for name, layer in model.model.named_modules():
        if name.endswith("mlp"):
            layer.activation_fn.register_full_backward_hook(generate_hook(name))

    return rel_layer

Hope it helps, Reduan

ChangWenhan commented 6 months ago

Thank you for your response! And wishing you a speedy recovery! We will continue to follow your works and the LXT library, as they have been of great assistance to us.

Wishing you all the best, Wenhan

rachtibat / LRP-eXplains-Transformers

How to extract the relevance of neurons in FFN layers #2