From what I see in the Llama2 code on hugging face, the attention_mask and position_ids variables are never set by the model. This results in cache['attention_mask'] and cache['position_ids'] being None and the script failing on lib/prune.py line 144:
if f"model.layers.{i}" in model.hf_device_map: ## handle the case for llama-30B and llama-65B, when the device map has multiple GPUs;
dev = model.hf_device_map[f"model.layers.{i}"]
inps, outs, attention_mask, position_ids = inps.to(dev), outs.to(dev), attention_mask.to(dev), position_ids.to(dev)
Please note that I do not have access to GPUs with more than 40GB VRAM, and the 7B model does not fit in 40GB for me, so I have to use a device map for the 7B model, which leads to the following error.
From what I see in the Llama2 code on hugging face, the
attention_mask
andposition_ids
variables are never set by the model. This results incache['attention_mask']
andcache['position_ids']
beingNone
and the script failing onlib/prune.py
line 144:Please note that I do not have access to GPUs with more than 40GB VRAM, and the 7B model does not fit in 40GB for me, so I have to use a device map for the 7B model, which leads to the following error.