princeton-nlp / LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
MIT License
306 stars 25 forks source link

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend #4

Open victorjiax opened 5 months ago

victorjiax commented 5 months ago

when load the optimizer.pt display the key is different KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'

the items in optimizer.pt state is 0~255.

whi497 commented 5 months ago

Same error, have you solved it?

xiamengzhou commented 5 months ago

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

JPegah commented 5 months ago

I am getting the same error despite using the same transformers version!

leopoldwhite commented 4 months ago

Hi, what transformers version are you using? I updated the requirement file to specify transformers==4.36.2.

Same error using transformers==4.36.2.

xiamengzhou commented 4 months ago

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

GCYZSL commented 4 months ago

Hi, thank you for your solution. I added the arguments, and there is a new error:

RuntimeError: Cannot writeback when the parameter shape changes
Expects torch.Size([131076096]) but got torch.Size([32001, 4096])
xiamengzhou commented 4 months ago

It seems to be a flatten issue, could you provide the script and code you ran?

GCYZSL commented 4 months ago

Thank you for your response! I run the warmup_lora_train.sh. It runs well before adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune. I added the arguments in the warmup_lora_train.sh as follows:

training_args="$base_training_args \
--model_name_or_path $model_path \
--output_dir $output_dir \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"
RrankPyramid commented 4 months ago

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

Tantor-D commented 4 months ago

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here. I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

@xiamengzhou Hi, I got the same error (KeyError: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight), while I am loading optimizer.pt instead of optimizer.bin. Is there a way to solve this?

I encountered the same error. When I tried running it without the --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune settings, I received optimizer.pt. Then, after modifying the code in get_info.py from optimizer.bin to optimizer.pt, I encountered a "KeyError" related to 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'. Has anyone found a solution to this issue?

xiamengzhou commented 4 months ago

@Tantor-D @RrankPyramid Could you check what the keys are like in your optimizer.pt file?

Tantor-D commented 4 months ago

@xiamengzhou Thank you for your reply! It seems I've identified an issue: The keys in the adam_optimizer_state dictionary appear as

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255])

However, the names list retrieved in the prepare_optimizer_state function of collect_grad_reps.py shows different information, indicating that the saved optimizer.pt may not be correctly storing key-value-based optimization states.

the names list appear as:

['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 
'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 
'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight',

I will add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to warmup_lora_train.sh and run again. Thanks again for your reply

tengerye commented 3 months ago

Hi @Tantor-D , have you found a solution yet?

After I add --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune.

I have a new error:

05/04/2024 06:43:26 - WARNING - accelerate.accelerator - FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
Traceback (most recent call last):
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 181, in <module>
    main()
  File "/data/tye/workspace/less_influence/LESS/less/train/train.py", line 161, in main
    train_result = trainer.train()
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1270, in prepare
    result = tuple(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1083, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/accelerate/accelerator.py", line 1429, in prepare_model
    model = FSDP(model, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 391, in __init__
    _auto_wrap(auto_wrap_kwargs, fsdp_kwargs, FullyShardedDataParallel)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 73, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 2 more times]
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 429, in _init_param_handle_from_module
    _init_param_handle_from_params(state, managed_params, fully_sharded_module)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 525, in _init_param_handle_from_params
    handle = FlatParamHandle(
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 366, in __init__
    self._init_flat_param(params, fully_sharded_module, use_orig_params)
  File "/data/tye/anaconda3/envs/instruct/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 440, in _init_flat_param
    raise ValueError(
ValueError: `FlatParameter` requires uniform `requires_grad`
Tantor-D commented 3 months ago

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"
tengerye commented 3 months ago

@Tantor-D Thank you so much for your kind reply. My problem came from the wrong version of my environment packages and it has been solved.

shangqing-liu commented 3 months ago

Hi @xiamengzhou I have another question about the code. After I tested the code, I found that we need to have two round warmup training, as first, I need to disable --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to finish a round of train and get the optimizer1.bin and then using --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune for another training to get optimizer2.bin. After that, I have to move the optimizer2.bin to optimizer1.bin due to the key problem like KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'.

Hence, may I ask the way to merge both training and use one round of training to get the warmup model?

Thanks.

shangqing-liu commented 2 months ago

The problem has been solved. Thanks

mihara-bot commented 2 months ago

@tengerye I solve the error by adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to less/scripts/train/warmup_lora_train.sh. The code works well.

Here is the changed version.

training_args="$base_training_args \
--fsdp 'full_shard auto_wrap' \
--fsdp_config llama_finetune \
--model_name_or_path $model_path \
--output_dir $output_dir \
--percentage $percentage \
--data_seed $data_seed \
--train_files ${train_files[@]} 2>&1 | tee $output_dir/train.log"

Hi, I use this code to run smoonthly for Step 1 but at Step 2 I encountered the optimizer,bin not found problem. https://github.com/princeton-nlp/LESS/issues/18 Would you please kindly help me on it?
Best regards

Yupei-Du commented 1 month ago

Hi, I realized that you would have to use fsdp to get the optimizer.pt file, which contains key (parameter-name)-value based optimization states. If you run without fsdp, you will get optimization.bin, which provides index-value based optimization states. Could you try adding --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune to your training script? Also, you can add more fsdp configurations here.

I am sure there is a workaround to get key-value based optimization states from an index-value based optimization states, and one can probably reuse functions from optimizer.state_dict() in huggingface.

I have a very basic workaround for this index-value-based file, probably there are bugs but so far it seem to work

from transformers.optimization import AdamW
from transformers.trainer_pt_utils import get_parameter_names
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS

def load_adam_state(model, optimizer_state_path):
    opt_grouped_parameters = [{'weight_decay': 0.0}, {'weight_decay': 0.0}]
    opt_grouped_parameter_names = [None, None]

    decay_parameters = [name for name in get_parameter_names(model, ALL_LAYERNORM_LAYERS) if 'bias' not in name]
    opt_grouped_parameters[0]['params'], opt_grouped_parameter_names[0] = zip(*[
        (p, n) for n, p in model.named_parameters() if n in decay_parameters and p.requires_grad])
    param_name_to_size_dict = {n: p.size() for n, p in model.named_parameters() if p.requires_grad}
    if len(param_name_to_size_dict) != len(opt_grouped_parameter_names[0]):
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = zip(*[
            (p, n) for n, p in model.named_parameters() if n not in decay_parameters and p.requires_grad])
    else:
        opt_grouped_parameters[1]['params'], opt_grouped_parameter_names[1] = [], []

    optimizer = AdamW(opt_grouped_parameters)
    optimizer.load_state_dict(torch.load(optimizer_state_path, map_location='cpu'))
    saved_state_dict = optimizer.state_dict()

    param_name_to_saved_state_dict = {}
    for group_idx in range(len(saved_state_dict['param_groups'])):
        group_param_indices = saved_state_dict['param_groups'][group_idx]['params']
        group_param_names = opt_grouped_parameter_names[group_idx]
        for param_idx, param_name in zip(group_param_indices, group_param_names):
            param_size = param_name_to_size_dict[param_name]
            exp_avg = saved_state_dict['state'][param_idx]['exp_avg']
            exp_avg_sq = saved_state_dict['state'][param_idx]['exp_avg_sq']
            assert exp_avg.size() == param_size
            param_name_to_saved_state_dict[param_name] = {'exp_avg': exp_avg, 'exp_avg_sq': exp_avg_sq}

    return param_name_to_saved_state_dict