请问以下如果想计算llama-2-7b在半精度下的显著性得分大概需要多少显存呢？

lancopku / label-words-are-anchors

Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

MIT License

144 stars 12 forks source link

请问以下如果想计算llama-2-7b在半精度下的显著性得分大概需要多少显存呢？ #11

Open zhiyunjiang opened 8 months ago

zhiyunjiang commented 8 months ago

如题

leanwang326 commented 8 months ago

大概14GB多一点吧，不过你可能需要更新下代码/自己在attention_attr.py里加一句 for p in model.parameters(): p.requires_grad = False 之前忘了加了，求了整个模型的参数的梯度，加上这句以后显存占用主要就是模型的参数，大概14GB吧

zhiyunjiang commented 8 months ago

谢谢，还有一点疑惑。加了这语句后，后续的loss.backward()是不会计算模型参数的梯度吗，还是说会计算但是被动态释放掉了。

leanwang326 commented 8 months ago

参数的不会计算吧，中间激活值的应该计算然后释放掉了。呃因为我实际上的实现是在attention_weight上乘了一个z = torch.ones_like(attention_weight)，然后z.requires_grad是true，通过求z的梯度得到的显著性值

noobimp commented 8 months ago

请问下式(1)注意力矩阵A为什么有个转置呢🤔

leanwang326 commented 8 months ago

转置那个是typo,不好意思。具体可以参考https://github.com/lancopku/label-words-are-anchors/issues/7，arxiv上我也更正了

noobimp commented 8 months ago

转置那个是typo,不好意思。具体可以参考https://github.com/lancopku/label-words-are-anchors/issues/7，arxiv上我也更正了

感谢感谢

ganchengguang commented 7 months ago

请问下如果想计算llama2模型的attentioner_for_attribution.py。gpt2_attn和GPT2AttentionerManager该怎么改呢？看了一下hf源码中的modeling_llama不像gpt2有def _attn这个函数。不知道该怎么套用了，求大佬能否给一点思路。

leanwang326 commented 7 months ago

请问下如果想计算llama2模型的attentioner_for_attribution.py。gpt2_attn和GPT2AttentionerManager该怎么改呢？看了一下hf源码中的modeling_llama不像gpt2有def _attn这个函数。不知道该怎么套用了，求大佬能否给一点思路。

我后来想了想之前写的写麻烦了，可以在attention_prob那里加一个hook，获取它的值和梯度，来计算saliency (我们代码里是在attention_prob上乘了z = torch.ones_like(attention_prob)的，然后取了z的梯度，这相当于attention_prob*attention_prob.grad) 另外，得在前向的时候顺便设置一下attention_prob.requires_grad = True，以免grad_fn没注册

MidiyaZhu commented 7 months ago

还有一个问题是

for p in model.parameters():
     p.requires_grad = False

这个代码添加后，在llama2-7b-chat-hf里后续 for idx, data in tqdm(enumerate(analysis_dataloader)): data = dict_to(data, model.device) print(data['input_ids'].shape) attentionermanger.zero_grad() output = model(**data, requires_grad=True) label = data['labels'] loss = F.cross_entropy(output['logits'], label) loss.backward() grad_fn在output, loss里均为None导致loss.backward()无法继续了。但是在gpt2-xl里可以继续。原因应该是gpt2-xl在执行attentionAdapter的时候进入了_forward重新设置了self.params = torch.ones_like(attn_weights, requires_grad=True)，但是llama2因为没有对应函数进入不了gpt2_attn去激活。请问这里有相关修改建议吗？多谢。

ganchengguang commented 7 months ago

还有一个问题是
for p in model.parameters():
     p.requires_grad = False
这个代码添加后，在llama2-7b-chat-hf里后续 for idx, data in tqdm(enumerate(analysis_dataloader)): data = dict_to(data, model.device) print(data['input_ids'].shape) attentionermanger.zero_grad() output = model(**data, requires_grad=True) label = data['labels'] loss = F.cross_entropy(output['logits'], label) loss.backward() grad_fn在output, loss里均为None导致loss.backward()无法继续了。但是在gpt2-xl里可以继续。原因应该是gpt2-xl在执行attentionAdapter的时候进入了_forward重新设置了self.params = torch.ones_like(attn_weights, requires_grad=True)，但是llama2因为没有对应函数进入不了gpt2_attn去激活。请问这里有相关修改建议吗？多谢。

同问，我也是卡在算loss的这里了，F.crossentropy报错。而且有些数据不会报错，有些会。不知道为什么

leanwang326 commented 7 months ago

我添加了两个文件activation_analysis.py和numpy_writer.py，能够支持存储计算过程中的中间结果和中间梯度结果（不同step的结果会append到一个列表里），用例见demo_grad.py

leanwang326 commented 7 months ago

grad_demo.py