Open acDante opened 3 weeks ago
Hi @acDante,
Thanks for reporting this. Working with device_map
for big models when resources are tight might require a bit of fiddling, and you might want to experiment with manually designing one to minimize occupancy on the first device, e.g.
from collections import OrderedDict
device_map = OrderedDict([('model.embed_tokens', 0),
('model.layers.0', 0),
('model.layers.1', 0),
('model.layers.2', 0),
('model.layers.3', 1),
('model.layers.4', 1),
('model.layers.5', 1),
('model.layers.6', 1),
('model.layers.7', 1),
('model.layers.8', 1),
('model.layers.9', 1),
('model.layers.10', 1),
('model.layers.11', 1),
('model.layers.12', 1),
('model.layers.13', 1),
('model.layers.14', 1),
('model.layers.15', 1),
('model.layers.16', 1),
('model.layers.17', 1),
('model.layers.18', 1),
('model.layers.19', 1),
('model.layers.20', 1),
('model.layers.21', 1),
('model.layers.22', 1),
('model.layers.23', 1),
('model.layers.24', 1),
('model.layers.25', 1),
('model.layers.26', 1),
('model.layers.27', 1),
('model.layers.28', 1),
('model.layers.29', 1),
('model.layers.30', 1),
('model.layers.31', 1),
('model.norm', 1),
('lm_head', 1)])
When you say that generate
runs without problem, are you specifying output_attentions=True
as a kwarg in that case? I suspect that storing all attention scores across all layers will take up a large amount of GPU memory in that case, too.
Hi @gsarti, any tips like the above for using sequential_integrated_gradients
? I used a NVIDIA L40S 46G where I can run the inference for Mistral successfully but can't use inseq
for my case. Could you please suggest any modification? Greatly thanks!
Question
Hi @gsarti , I find that
attribute()
function causes CUDA out of memory issue when the input length exceeds about 2500 tokens. I used Llama2-7b / Mistral-7b models to get attribution and choseattention
as attribution method. Other gradient-based or perturbation-based attribution methods may consume even more GPU memory and also cause CUDA out of memory issue.I tested the following code snippet on two A100-80GB. At some point during the attribution process, the GPU memory consumption becomes extremely high (much higher than just running
generate()
) Increasing the number of GPUs cannot resolve this issue, since the GPU memory consumption on the first GPU will always exceed 80GB.Is it possible to get attribution from long context with inseq?
Additional context
Stack trace
Checklist
issues
.