locuslab / wanda

A simple and effective LLM pruning approach.
https://arxiv.org/abs/2306.11695
MIT License
677 stars 91 forks source link

Support for LLaMA-2 #23

Open junzhang-zj opened 1 year ago

junzhang-zj commented 1 year ago

I couldn't reach 'allenai/c4' on the Hub.

junzhang-zj commented 1 year ago

I have solved the data problem, but I ran into a new problem. I used wanda to prune LLaMA-2-13B and got a zero score on rouge-2 of CNN/DM, my perplexity of C4 on unstructured pruning is high to 56050.3008.

Eric-mingjie commented 1 year ago

Hi, we just updated the repo supporting pruning LLaMA-2 model, see here for the corresponding command. We also provide the results from our own run.

junzhang-zj commented 1 year ago

@Eric-mingjie Thanks!

junzhang-zj commented 1 year ago

@Eric-mingjie Is the performance of ppl related to the environment, I still get poor results on LLaMA-2.

Eric-mingjie commented 1 year ago

I think for LLaMA-2, i used the transformers library with version 4.34.0.dev0 to load the models. I used this commit 0a55d9f7376f72ad3ff296d4249840021b03bcc4 on the main branch specifically. What ppl number do you get?

junzhang-zj commented 1 year ago

My environment is transformers 4.34.0.dev0, accelerate 0.24.0.dev0 and I get ppl 146760.7188 and now a lot of cuda errors.

Eric-mingjie commented 1 year ago

hmm, can you load the llama-2-7b dense model and test the perplexity, in this case, you can simply pass --sparsity_ratio 0 to avoid doing any pruning?

junzhang-zj commented 1 year ago

OK, i will try it to check.

Eric-mingjie commented 1 year ago

This is the output of conda env export from the conda environment i am running. Hope this may be helpful. https://gist.github.com/Eric-mingjie/4ca851c64144d53800d60e4c74ebfbaf

junzhang-zj commented 1 year ago

@Eric-mingjie I get ppl wikitext_train 5.171178340911865, wikitext_test 4.883730888366699 on Llama-2-13b with no pruning.

junzhang-zj commented 1 year ago

I think it might be helpful to think in terms of why 'wrapped_layers[name].scaler_row' is the all-0 tensor causing the metric to fail, have you run into this? Looks like something's wrong with the hook.

junzhang-zj commented 1 year ago

😭, I finally found the bug, we need to set pretraining_tp to 1, otherwise, the forward will not be executed and the callback will fail. ppl of llama-2-13b (4:8) on wikitext_train 7.27443265914917, wikitext_test 7.004149913787842

Eric-mingjie commented 1 year ago

That's good to know. I was starting to rerun the code on my end.

simlaharma commented 10 months ago

I couldn't reach 'allenai/c4' on the Hub.

Hello @junzhang-zj , How did you solve the data problem? I get the following message:

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']

I changed the code for the c4 data to the following:

traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Then, I started getting the following error:

File "/simla/wanda/lib/data.py", line 48, in get_c4 traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset builder_instance.download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare self._download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1118, in _download_and_prepare verify_splits(self.info.splits, split_dict) File "/home/.local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 92, in verify_splits raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits))) datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

I tried downloading with:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

After downloading the whole dataset, I need to change the load_dataset function to call the local files. So I did the following:

traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
 valdata = load_dataset('/simla/wanda/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation', trust_remote_code=True)

Now I am getting the following error:

Failed to read file '/simla/wanda/c4/en/c4-train.00000-of-01024.json.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0 Generating train split: 0%| | 0/364868892 [00:00<?, ? examples/s] Traceback (most recent call last): File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables dataset = json.load(f) File "/usr/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_splitsingle for , table in generator: File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables raise e File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables pa_table = paj.read_json( File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/simla/wanda/main.py", line 110, in main() File "/simla/wanda/main.py", line 69, in main prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m) File "/simla/wanda/lib/prune.py", line 132, in prunewanda dataloader, = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer) File "/simla/wanda/lib/data.py", line 80, in get_loaders return get_c4(nsamples, seed, seqlen, tokenizer) File "/simla/wanda/lib/data.py", line 50, in get_c4 traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True) File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset builder_instance.download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare self._download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

junzhang-zj commented 10 months ago

@simlaharma Have you tried downloading directly from the huggingface website and then loading it locally?

rsong0606 commented 7 months ago

@simlaharma

I had a similar issue as you did. check this post, it worked for me.

https://github.com/huggingface/datasets/issues/6746

rakeshsai22 commented 3 months ago

can we use Wanda for pruning the last linear layer in Llama 2?