rsong0606 commented 5 months ago

Great work team!

Currently, I am pruning on the llama2-7b-chat-hf model from hugging face.

python main.py

--model NousResearch/Llama-2-7b-chat-hf
--prune_method wanda
--sparsity_ratio 0.5
--sparsity_type 2:4
--save out/llama_7b-chat-hf/structured/wanda/

got this error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 11.69 MiB is free. Including non-PyTorch memory, this process has 21.98 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 61.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Eric-mingjie commented 5 months ago

I think you need at least 14GB GPU memory to load the 7b model in fp16.

rsong0606 commented 5 months ago

@Eric-mingjie Thanks Eric, mine is 24 GB GPU memory. Given that at least 14GB would be used to load the model. I still have ~10 GB left in Nvidia L4. Are there any extra activities taking more memory and can we avoid in the arguments?

kast424 commented 5 months ago

Mine has 80GB of GPU RAM >>>>>NVIDIA A100 (and H100) GPU in Stanage has 80GB of GPU RAM still got this error. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

complete error for reference:

torch 2.3.0 transformers 4.41.0.dev0 accelerate 0.31.0.dev0

of gpus: 1

loading llm model mistralai/Mistral-7B-Instruct-v0.2 ^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:12<00:24, 12.09s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:29<00:15,$ use device cuda:0 pruning starts loading calibdation data dataset loading complete Traceback (most recent call last): File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 110, in main() File "/mnt/parscratch/users/acq22stk/teamproject/wanda/main.py", line 69, in main prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m) File "/mnt/parscratch/users/acq22stk/teamproject/wanda/lib/prune.py", line 160, in prune_wanda outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0] File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward hidden_states = self.input_layernorm(hidden_states) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/users/acq22stk/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/users/acq22stk/.conda/envs/prune_llm/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 85, in forward hidden_states = hidden_states.to(torch.float32) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU

nehaprakriya commented 2 months ago

I have the same error with the Mixtral 8x7B model using 4 A6000 GPUs (48GiB memory per device).

locuslab / wanda

gpu memory size recommended for pruning the llama2-7b-chat-hf model #44

of gpus: 1