Closed karths8 closed 9 months ago
I ran into this same exact issue as well.
any solution?
Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8
and float32
parameters and it throws the error.
Let us consider how we can fix this.
Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has
uint8
andfloat32
parameters and it throws the error. Let us consider how we can fix this.
Do you fix this problem for now?
I have the same issue. I've attached my deepspeed config file. I'm running my training off the Axolotl library.
I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.
I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.
Thank you for your https://github.com/microsoft/DeepSpeed/pull/4647 !! It works well in my environment, too!
I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.
Hi tohtana, I found the issue,
I changed to your code, but the training was good, but LoRA size didn't match when I inference.
model.save_pretrained(my_model) -> adapter_model.bin size -> 163KB.
I think the weight of LoRA was not saved.
How can I solve this problem?
size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 13824]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]).
Hi @momozzing, can you share the code to reproduce this?
Hi @momozzing, can you share the code to reproduce this?
Ok, My baseline model is LLAMA.
Zero stage 2 works well with this code. However, zero stage 3 does not work.
tokenizer = AutoTokenizer.from_pretrained(config["model"]["tokenizer_path"], eos_token='<|endoftext|>', add_bos_token=False)
model_config = LlamaConfig.from_pretrained(config["model"]["model_path"])
model_config.eos_token_id = tokenizer.eos_token_id
model_config.use_cache = False
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
config["model"]["model_path"],
config=model_config,
quantization_config=bnb_config,
)
lora_config = LoraConfig(
r=config["lora"]["r"],
lora_alpha=config["lora"]["lora_alpha"],
target_modules=config["lora"]["target_modules"],
lora_dropout=config["lora"]["lora_dropout"],
bias=config["lora"]["bias"],
task_type=config["lora"]["task_type"],
)
for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later
if param.ndim == 1:
# cast the small parameters (e.g. layernorm) to fp32 for stability
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
model=prepare_model_for_kbit_training(model)
## load lora
model = get_peft_model(model, lora_config)
optimizer = bnb.optim.PagedAdam32bit(model.parameters(), lr=2e-4, betas=(0.9, 0.999)) # equivalent
print_rank_0(config, f"Trainable_parameters: {get_trainable_parameters(model)}", config["global_rank"])
model, _, _, _ = deepspeed.initialize(
model=model,
args={"local_rank":config["local_rank"], "global_rank":config["global_rank"]},
config=config["ds_config"],
optimizer = optimizer,
)
"ds_config":{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"bf16": {
"enabled": true
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-4,
"warmup_num_steps": 1000,
"total_num_steps": 10000
}
},
"zero_optimization": {
"stage": 3,
"allgather_partitions": true,
"allgather_bucket_size":2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 2e9,
"stage3_max_reuse_distance": 2e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"zero_allow_untested_optimizer": true,
"wall_clock_breakdown": false,
"steps_per_print": 100000
}
}
"ds_config":{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"bf16": {
"enabled": true
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-4,
"warmup_num_steps": 1000,
"total_num_steps": 10000
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size":2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
},
"zero_allow_untested_optimizer": true,
"wall_clock_breakdown": false,
"steps_per_print": 100000
}
}
Hi @momozzing, It appears that the checkpoint for ZeRO3 is partitioned, so we'll need to use DeepSpeed's loading function for it. You can find more information in the document.
Also, the error you mentioned seems to be distinct from the initial problem. If it persists, I suggest creating a new issue to address it.
Hi @tohtana, Thank you for your answer.
I'm using this code
deepspeed.DeepSpeedEngine.save_checkpoint(save_dir=save_dir , exclude_frozen_parameters=True)
but, save_checkpoint only saves the optimizer state, model state is not saved.
-rw-rw-r-- 1 519K 09:32 zero_pp_rank_0_mp_rank_00_model_states.pt -rw-rw-r-- 1 478M 09:32 zero_pp_rank_0_mp_rank_00_optim_states.pt -rw-rw-r-- 1 519K 09:32 zero_pp_rank_1_mp_rank_00_model_states.pt -rw-rw-r-- 1 478M 09:32 zero_pp_rank_1_mp_rank_00_optim_states.pt
When I save the trained model, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).
Is there any way to save LoRA's trained weight??
Hi @momozzing,
I haven't run the code, but isn't zero_pp_rank_0_mp_rank_00_model_states.pt
the model state?
Since you specified exclude_frozen_parameters=True
, it only has parameters that are trained for LoRA.
You can find an example of the combination of ZeRO3 and LoRA in DeepSpeed-Chat. In the following example, it saves all the parameters including ones for LoRA. https://github.com/microsoft/DeepSpeedExamples/blob/ccb2a3400a05ea075b643bb3aeabb02f9883c5da/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L385
Hi @tohtana
LLAMA + QLoRA without deepspeed stores the size of the adapter_model.bin at 477MB.
LLAMA + QLoRA with deepspeed zero2 stores the size of the adapter_model.bin at 477MB.
but, LLAMA + QLoRA with deepspeed zero3 stores the size of the adapter_model.bin at 519K.
so, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).
Is there any way to save LoRA's trained weight with deepspeed zero3??
Does deepspeed zero3 support bitsandbytes?
Hi @momozzing
ZeRO3 sets an empty size (Size([0]) to a parameter object and has real tensor data in a different attribute. We cannot say that parameters are not saved even when we see torch.Size([0])
in the error message. ZeRO3 also saves partitioned parameters, which are in a different format from the normal PyTorch's checkpoint. So we need to use DeepSpeed's API to load a checkpoint.
In your code, you use AutoModelForCausalLM.from_pretrained()
. This cannot properly load a checkpoint that ZeRO3 saved.
Here is another example using HF trainer and LoRA. This script seems to save parameters properly. Can you check this as well? https://github.com/tohtana/ds_repro_4295/blob/main/finetune_llama_v2.py
Hi, @tohtana As you said, using DeepSpeed's API solved the problem.
Here's how I solved it.
state_dict = self.engine._zero3_consolidated_16bit_state_dict()
lora_state_dict = get_peft_model_state_dict(self.model, state_dict)
self.model.save_pretrained(save_dir)
torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))
Thank you very much for your reply.
this is a workaround, not a proper solution as this can be really expensive:
state_dict = self.engine._zero3_consolidated_16bit_state_dict()
get_peft_model_state_dict
ideally needs to be fixed to become ZeRO aware - it'll need to do that for Deepspeed ZeRO and FSDP as well. In the case of Deepspeed it needs to gather the weights like it's done here:
This is the efficient way of doing that as it'd gather one layer at a time and incur little memory overhead.
with zero.init
enabled, I get below with the latest branch of Accelerate, Transformers and latest release of Deepspeed:
File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
raise ValueError(set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
ValueError : raise ValueError(raise ValueError(Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
ValueErrorValueError: : Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
2. Below is the memory usage when `zero_init=False` and qlora+deepSpeed stage 3 for Llama 70B. GPU memory usage per GPU: 20% of 80Gb = 16GB per GPU. However, the initial memory per GPU during model loading would be 35GB (0.5*70B) as each GPU loads the pretrained model in 4 bits. If `zero_init` is enabled with QLoRA, then one could finetune 70B model on 8 24GB GPUs which would be great.
Code: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/sft/training
Command:
accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.py \ --seed 100 \ --model_name_or_path "meta-llama/Llama-2-70b-hf" \ --dataset_name "smangrul/ultrachat-10k-chatml" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "mistral-sft-lora-ds" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --gradient_checkpointing True \ --use_reentrant True \ --dataset_text_field "content" \ --use_flash_attn True \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16"
<img width="714" alt="Screenshot 2024-03-05 at 6 30 39 PM" src="https://github.com/microsoft/DeepSpeed/assets/13534540/ac0b2831-bcda-460c-a4ab-b9cb4d300d35">
Describe the bug Deepspeed runs into a bug while training a CodeLlama-34B model with QLoRA using this script
To Reproduce Run the script with deepspeed file passed into the params. The deepspeed config i used is given below:
Expected behavior
Expected behaviour is deepspeed training without any errors. The following error (
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>
) pops up with the traceback as given belowds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed'] deepspeed info ................... 0.10.3+542dc0d5, 542dc0d5, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8 shared memory (/dev/shm) size .... 188.00 GB
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context used deepspeed launcher with huggingface integration
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.