Open junzhang-zj opened 1 year ago
I have solved the data problem, but I ran into a new problem. I used wanda to prune LLaMA-2-13B and got a zero score on rouge-2 of CNN/DM, my perplexity of C4 on unstructured pruning is high to 56050.3008.
Hi, we just updated the repo supporting pruning LLaMA-2 model, see here for the corresponding command. We also provide the results from our own run.
@Eric-mingjie Thanks!
@Eric-mingjie Is the performance of ppl related to the environment, I still get poor results on LLaMA-2.
I think for LLaMA-2
, i used the transformers
library with version 4.34.0.dev0
to load the models. I used this commit 0a55d9f7376f72ad3ff296d4249840021b03bcc4
on the main branch specifically. What ppl number do you get?
My environment is transformers 4.34.0.dev0, accelerate 0.24.0.dev0 and I get ppl 146760.7188 and now a lot of cuda errors.
hmm, can you load the llama-2-7b
dense model and test the perplexity, in this case, you can simply pass --sparsity_ratio 0
to avoid doing any pruning?
OK, i will try it to check.
This is the output of conda env export
from the conda environment i am running. Hope this may be helpful. https://gist.github.com/Eric-mingjie/4ca851c64144d53800d60e4c74ebfbaf
@Eric-mingjie I get ppl wikitext_train 5.171178340911865, wikitext_test 4.883730888366699 on Llama-2-13b with no pruning.
I think it might be helpful to think in terms of why 'wrapped_layers[name].scaler_row' is the all-0 tensor causing the metric to fail, have you run into this? Looks like something's wrong with the hook.
😭, I finally found the bug, we need to set pretraining_tp to 1, otherwise, the forward will not be executed and the callback will fail. ppl of llama-2-13b (4:8) on wikitext_train 7.27443265914917, wikitext_test 7.004149913787842
That's good to know. I was starting to rerun the code on my end.
I couldn't reach 'allenai/c4' on the Hub.
Hello @junzhang-zj , How did you solve the data problem? I get the following message:
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
I changed the code for the c4 data to the following:
traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')
Then, I started getting the following error:
File "/simla/wanda/lib/data.py", line 48, in get_c4 traindata = load_dataset('allenai/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train') File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset builder_instance.download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare self._download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1118, in _download_and_prepare verify_splits(self.info.splits, split_dict) File "/home/.local/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 92, in verify_splits raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits))) datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}
I tried downloading with:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
After downloading the whole dataset, I need to change the load_dataset function to call the local files. So I did the following:
traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True)
valdata = load_dataset('/simla/wanda/c4', 'en', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation', trust_remote_code=True)
Now I am getting the following error:
Failed to read file '/simla/wanda/c4/en/c4-train.00000-of-01024.json.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0 Generating train split: 0%| | 0/364868892 [00:00<?, ? examples/s] Traceback (most recent call last): File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables dataset = json.load(f) File "/usr/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_splitsingle for , table in generator: File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables raise e File "/home/.local/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables pa_table = paj.read_json( File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/simla/wanda/main.py", line 110, in
main() File "/simla/wanda/main.py", line 69, in main prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m) File "/simla/wanda/lib/prune.py", line 132, in prunewanda dataloader, = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer) File "/simla/wanda/lib/data.py", line 80, in get_loaders return get_c4(nsamples, seed, seqlen, tokenizer) File "/simla/wanda/lib/data.py", line 50, in get_c4 traindata = load_dataset('/simla/wanda/c4', 'en', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train', trust_remote_code=True) File "/home/.local/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset builder_instance.download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare self._download_and_prepare( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/.local/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
@simlaharma Have you tried downloading directly from the huggingface website and then loading it locally?
@simlaharma
I had a similar issue as you did. check this post, it worked for me.
can we use Wanda for pruning the last linear layer in Llama 2?
I couldn't reach 'allenai/c4' on the Hub.