locuslab / wanda

A simple and effective LLM pruning approach.
https://arxiv.org/abs/2306.11695
MIT License
652 stars 87 forks source link

Llama2 `ExpectedMoreSplits` Exception #57

Open time-less-ness opened 4 months ago

time-less-ness commented 4 months ago

Command:

python main.py     --model /data/Llama-2-7b-chat-hf/      --prune_method wanda     --sparsity_ratio 0.5     --sparsity_type unstructured     --save out/llama_2_7b/unstructured/wanda/

Result:

  File "Dev/wanda/main.py", line 110, in <module>
    main()
  File "Dev/wanda/main.py", line 69, in main
    prune_wanda(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m)
  File "Dev/wanda/lib/prune.py", line 132, in prune_wanda
    dataloader, _ = get_loaders("c4",nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer)
  File "Dev/wanda/lib/data.py", line 73, in get_loaders
    return get_c4(nsamples, seed, seqlen, tokenizer)
  File "Dev/wanda/lib/data.py", line 43, in get_c4
    traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
  File ".conda/envs/prune_llm/lib/python3.9/site-packages/datasets/load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File ".conda/envs/prune_llm/lib/python3.9/site-packages/datasets/builder.py", line 891, in download_and_prepare
    self._download_and_prepare(
  File ".conda/envs/prune_llm/lib/python3.9/site-packages/datasets/builder.py", line 1004, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File ".conda/envs/prune_llm/lib/python3.9/site-packages/datasets/utils/info_utils.py", line 91, in verify_splits
    raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

In case it matters, the Llama-2-7b-chat-hf folder:

Llama-2-7b-chat-hf/
total 39509433
-rw-r--r-- 1 usr grp         21 Jun 17 14:06 added_tokens.json
-rw-r--r-- 1 usr grp        583 Jun 17 14:06 config.json
-rw-r--r-- 1 usr grp        200 Jun 17 14:06 generation_config.json
-rw-r--r-- 1 usr grp       7020 Jun 17 14:06 LICENSE.txt
-rw-r--r-- 1 usr grp 9976576152 Jun 17 14:15 model-00001-of-00002.safetensors
-rw-r--r-- 1 usr grp 3500296424 Jun 17 14:09 model-00002-of-00002.safetensors
-rw-r--r-- 1 usr grp      26788 Jun 17 14:06 model.safetensors.index.json
-rw-r--r-- 1 usr grp 9877989586 Jun 17 14:13 pytorch_model-00001-of-00003.bin
-rw-r--r-- 1 usr grp 9894801014 Jun 17 14:14 pytorch_model-00002-of-00003.bin
-rw-r--r-- 1 usr grp 7180990649 Jun 17 14:12 pytorch_model-00003-of-00003.bin
-rw-r--r-- 1 usr grp      26788 Jun 17 14:06 pytorch_model.bin.index.json
-rw-r--r-- 1 usr grp      10148 Jun 17 14:06 README.md
-rw-r--r-- 1 usr grp        435 Jun 17 14:06 special_tokens_map.json
-rw-r--r-- 1 usr grp        746 Jun 17 14:06 tokenizer_config.json
-rw-r--r-- 1 usr grp    1842764 Jun 17 14:06 tokenizer.json
-rw-r--r-- 1 usr grp     499723 Jun 17 14:06 tokenizer.model
-rw-r--r-- 1 usr grp       4766 Jun 17 14:06 USE_POLICY.md
piotr25691 commented 2 months ago

the solution from #36 might work here

https://huggingface.co/datasets/allenai/c4/discussions/7 apply in: https://github.com/locuslab/wanda/blob/main/lib/data.py#L43-44