princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

missmatch shape #52

Closed coderchem closed 7 months ago

coderchem commented 7 months ago

下面是我的config.pt的文件内容

{'data_local': '/workspace/LLM-shearing/LLM-Shearing/llmshearing/data/mds_sample_redpajama/for_prune', 'data_remote': None, 'tokenizer_name': '/workspace/LLM-shearing/models/Llama-2-7b-hf', 'max_seq_len': 4096, 'global_seed': 17, 'run_name': 'llama2_7b_pruning_scaling_constant_to2.7b_sl4096', 'model': {'name': 'mosaic_llama2_7b', 'path': '/workspace/LLM-shearing/models/Llama-2-7b-composer/state_dict.pt', 'init_device': 'cpu', 'tokenizer_name': '${tokenizer_name}', 'd_model': 4096, 'n_heads': 32, 'n_layers': 32, 'intermediate_size': 11008, 'max_seq_len': '${max_seq_len}', 'vocab_size': 32000, 'init_std': 0.02, 'attn_pdrop': 0.0, 'resid_pdrop': 0.0, 'emb_pdrop': 0.0, 'attn_impl': 'flash', 'rms_norm_eps': 1e-05, 'l0_module': {'start_sparsity': 0.0, 'target_sparsity': 0.5, 'pruning_modules': ['head', 'intermediate', 'layer', 'hidden'], 'lagrangian_warmup_steps': '640ba', 'target_model': {'d_model': 2560, 'n_layers': 32, 'n_heads': 20, 'intermediate_size': 6912, 'vocab_size': 32000}, 'eval_target_model': False}}, 'tokenizer': {'type': 'hftokenizer', 'args': {'tokenizer_name': '${tokenizer_name}', 'max_seq_len': '${max_seq_len}'}}, 'train_loader': {'name': 'text', 'dataset': {'local': '${data_local}', 'remote': '${data_remote}', 'split': 'wikipedia', 'shuffle': True, 'tokenizer_name': '${tokenizer_name}', 'max_seq_len': '${max_seq_len}', 'shuffle_seed': '${global_seed}', 'is_uint16': True}, 'drop_last': True, 'num_workers': 0, 'prefetch_factor': None, 'persistent_workers': False}, 'eval_loader': {'name': 'text', 'dataset': {'local': '/workspace/LLM-shearing/LLM-Shearing/llmshearing/data/mds_sample_redpajama/eval', 'remote': '${data_remote}', 'split': 'eval_merge', 'shuffle': False, 'tokenizer_name': '${tokenizer_name}', 'max_seq_len': '${max_seq_len}', 'shuffle_seed': '${global_seed}', 'is_uint16': True}, 'drop_last': False, 'num_workers': 8}, 'scheduler': {'name': 'cosine_with_warmup', 't_warmup': '320ba', 'alpha_f': 0.1}, 'optimizer': {'name': 'decoupled_adamw', 'lr': 0.0001, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.0, 'lag_lr': 1.0}, 'algorithms': {'gradient_clipping': {'clipping_type': 'norm', 'clipping_threshold': 1.0}}, 'max_duration': '3200ba', 'eval_interval': '50ba', 'eval_subset_num_batches': 1000, 'global_train_batch_size': 8, 'seed': '${global_seed}', 'device_eval_batch_size': 2, 'device_train_microbatch_size': 4, 'precision': 'amp_bf16', 'fsdp_config': {'sharding_strategy': 'FULL_SHARD', 'mixed_precision': 'DEFAULT', 'activation_checkpointing': True, 'activation_cpu_offload': False, 'verbose': False}, 'progress_bar': False, 'log_to_console': True, 'console_log_interval': '1ba', 'callbacks': {'speed_monitor': {'window_size': 10}, 'memory_monitor': {}, 'lr_monitor': {}, 'data_loading': {'dynamic': True, 'update_type': 'constant', 'proportion': [0.67, 0.045, 0.045, 0.02, 0.045, 0.025, 0.15], 'set_names': ['cc', 'github', 'book', 'stackexchange', 'wiki', 'arxiv', 'c4-rp'], 'target_loss': None}}, 'loggers': {'wandb': {'project': 'pruning', 'name': '${run_name}', 'entity': 'pruning', 'init_kwargs': {'mode': 'offline', 'dir': '/workspace/LLM-shearing/models/llama2_7b_pruning_scaling_constant_to2.7b_sl4096'}}}, 'save_interval': '3200ba', 'save_folder': '/workspace/LLM-shearing/models/llama2_7b_pruning_scaling_constant_to2.7b_sl4096', 'eval_first': False, 'autoresume': False}

我在执行composer_to_hf 时有2个问题: 一个是发现训练之后的模型结果存在keyError: l0_module找不到的问题,代码在 num_layers = get_layer_num_from_weights(weights) keymap = get_key_map_from_composer_to_hf(num_layers) hf_weights = {keymap[key]: weights[key] for key in weights if "rotary" not in key }
这段代码中是说过滤掉所有的包含 rotary 权重,但是问题是weights中有l0_module相关的权重信息,导致不能转到hf上,从而报错,我尝试将l0相关的weihgt干掉,然后保存下来的模型又说missmatch 1、去掉l0相关权重是否可行? 2、我training的配置文件是否存在问题? 感谢支持;