Memory and time requirements for Mistral-7B

Hi,

I am trying to prune Mistral 7B (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and while I was able to successfully run the commands for magnitude pruning, I was facing issues with SparseGPT and Wanda.

SparseGPT: Took more than an hour and threw a CUDA OOM error (I'm working on a g5.24x which is 4 x 24GB), so I believe that should be definitely enough
Wanda: The code is running for ~2hrs! And failed with CUDA OOM

Commands used: python main.py --model 'mistralai/Mistral-7B-Instruct-v0.2' --prune_method sparsegpt --sparsity_ratio 0.1 --sparsity_type unstructured --save out/mistral_7b/unstructured/sparsegpt/0.1/ --save_model out/mistral_7b/unstructured/sparsegpt/0.1/

python main.py --model 'mistralai/Mistral-7B-Instruct-v0.2' --prune_method wanda --sparsity_ratio 0.1 --sparsity_type unstructured --save out/mistral_7b/unstructured/wanda/0.1/ --save_model out/mistral_7b/unstructured/wanda/0.1/

Any help here would be greatly appreciated :), tagging authors - @liuzhuang13 , @Eric-mingjie and @eltociear

Update --- The error comes from initializing torch.zeros(), below is the tracestack.

Traceback (most recent call last):                                                                                                                                                        
  File "/home/ubuntu/Compress_Align/wanda/main.py", line 113, in <module>                                                                                                                 
    main()                                                                                                                                                                                
  File "/home/ubuntu/Compress_Align/wanda/main.py", line 73, in main                                                                                                                      
    prune_sparsegpt(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)                                                                                                     
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context                                                 
    return func(*args, **kwargs)                                                                                                                                                          
  File "/home/ubuntu/Compress_Align/wanda/lib/prune.py", line 230, in prune_sparsegpt                                                                                                     
    inps = torch.zeros(                                                                                                                                                                   
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB. GPU 1 has a total capacity of 21.99 GiB of which 16.77 GiB is free. Including non-PyTorch memory, this process ha
s 5.21 GiB memory in use. Of the allocated memory 4.88 GiB is allocated by PyTorch, and 89.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try 
setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Upon debugging further, here are the values for print(args.nsamples, model.seqlen, model.config.hidden_size)

Mistral-7B:  128, 32768, 4096
Llama-2-7B: 128, 4096, 4096

So basically the sequence length of Mistral is very large which doesn't allow for creation of tensor.

Are there any suggestions to overcome this error?

P.S: I think this issue is similar to #51 i.e support to Mistral models

locuslab / wanda

Memory and time requirements for Mistral-7B #68