panuthept / IRIS

Improving Robustness of LLMs on Input Variations by Mitigating Spurious Intermediate States
Apache License 2.0
8 stars 3 forks source link

[Analysis] Use TransformerLensGenerativeLLM to find important components in Qwen/Qwen2-0.5B-Instruct #12

Open panuthept opened 2 months ago

panuthept commented 2 months ago

Example script:

python scripts/benchmark_jailbreak_bench.py \
    --model_name Qwen/Qwen2-0.5B-Instruct \
    --intervention \
    --intervention_layers 19 20 21 22 \
    --max_tokens 512 

Vary the intervention_layers in range of 0-23 to find the best combination.

Feel free to modify the TransformerLensGenerativeLLM._generate() to speed thing up. The current implementation predict the whole sequence, we probably only need to predict the first token to calculate the indirect effect.

panuthept commented 2 months ago

Additionally, the current implementation of TransformerLensGenerativeLLM only support patching the activation of "mlp_post" and "mlp_pre". Feel free to modify TransformerLensGenerativeLLM to support patching the attention module.