[BUG] Error while training with Deepspeed

Describe the bug Deepspeed runs into a bug while training a CodeLlama-34B model with QLoRA using this script

To Reproduce Run the script with deepspeed file passed into the params. The deepspeed config i used is given below:

{
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 16777216,
    "stage3_prefetch_bucket_size": 15099494.4,
    "stage3_param_persistence_threshold": 40960,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Expected behavior

Expected behaviour is deepspeed training without any errors. The following error (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>) pops up with the traceback as given below

[2023-09-08 19:26:04,877] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:07,007] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-08 19:26:07,007] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_llama2_codegen.py --bf16 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --model_name /workspace/CodeLlama-34b-Python-hf --dataset_name llama_data --save_steps 100 --num_train_epochs 2 --learning_rate 2e-5 --weight_decay 0.01 --lora_alpha 256 --lora_r 32 --use_qlora True --max_seq_length 8192 --run_name CodeGen-34B-combined-train --deepspeed_path /workspace/deepspeed_config_stage3.json
[2023-09-08 19:26:08,340] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:10,452] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-08 19:26:10,452] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-08 19:26:10,452] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-09-08 19:26:10,452] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-08 19:26:10,452] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-08 19:26:12,700] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:12,745] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:19,891] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:26:19,891] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-08 19:26:20,266] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:27:56,714] [INFO] [partition_parameters.py:342:__exit__] finished initializing model - num_params = 435, num_elems = 33.74B
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.65s/it]
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.61s/it]
trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map:   0%|                                 | 0/2787 [00:00<?, ? examples/s]trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 630.71 examples/s]
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 656.34 examples/s]
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 20.74338674545288 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 12.17668604850769 seconds
Parameter Offload: Total persistent parameters: 2367488 in 145 params
You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f74ee477ed0>
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed'] deepspeed info ................... 0.10.3+542dc0d5, 542dc0d5, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8 shared memory (/dev/shm) size .... 188.00 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types 4xA100 80GB
Interconnects (if applicable) N/A
Python version 3.10
Any other relevant info about your setup

Launcher context used deepspeed launcher with huggingface integration

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

I ran into this same exact issue as well.

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error. Let us consider how we can fix this.

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error. Let us consider how we can fix this.

Do you fix this problem for now?

I have the same issue. I've attached my deepspeed config file. I'm running my training off the Axolotl library.

ds_config_zero3.json

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Thank you for your https://github.com/microsoft/DeepSpeed/pull/4647 !! It works well in my environment, too!

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Hi tohtana, I found the issue,

I changed to your code, but the training was good, but LoRA size didn't match when I inference.

model.save_pretrained(my_model) -> adapter_model.bin size -> 163KB.

I think the weight of LoRA was not saved.

How can I solve this problem?

size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 13824]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]).

Hi @momozzing, can you share the code to reproduce this?

Hi @momozzing, can you share the code to reproduce this?

Ok, My baseline model is LLAMA.

Zero stage 2 works well with this code. However, zero stage 3 does not work.

Code

tokenizer = AutoTokenizer.from_pretrained(config["model"]["tokenizer_path"], eos_token='<|endoftext|>', add_bos_token=False)

model_config = LlamaConfig.from_pretrained(config["model"]["model_path"])
model_config.eos_token_id = tokenizer.eos_token_id
model_config.use_cache = False

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)    

model = AutoModelForCausalLM.from_pretrained(
    config["model"]["model_path"], 
    config=model_config,
    quantization_config=bnb_config,
    )

lora_config = LoraConfig(
        r=config["lora"]["r"],
        lora_alpha=config["lora"]["lora_alpha"],
        target_modules=config["lora"]["target_modules"],
        lora_dropout=config["lora"]["lora_dropout"],
        bias=config["lora"]["bias"],
        task_type=config["lora"]["task_type"],
)

for param in model.parameters():
    param.requires_grad = False  # freeze the model - train adapters later
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()  
model=prepare_model_for_kbit_training(model)

## load lora
model = get_peft_model(model, lora_config)

optimizer = bnb.optim.PagedAdam32bit(model.parameters(), lr=2e-4, betas=(0.9, 0.999)) # equivalent
print_rank_0(config, f"Trainable_parameters: {get_trainable_parameters(model)}", config["global_rank"])

model, _, _, _ = deepspeed.initialize(
    model=model,
    args={"local_rank":config["local_rank"], "global_rank":config["global_rank"]},
    config=config["ds_config"],
    optimizer = optimizer,
)

ds_config_zero3

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 3,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 2e9,
    "stage3_max_reuse_distance": 2e9,
    "stage3_gather_16bit_weights_on_model_save": true    
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

ds_config_zero2

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 2,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,  
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

Hi @momozzing, It appears that the checkpoint for ZeRO3 is partitioned, so we'll need to use DeepSpeed's loading function for it. You can find more information in the document.

Also, the error you mentioned seems to be distinct from the initial problem. If it persists, I suggest creating a new issue to address it.

Hi @tohtana, Thank you for your answer.

I'm using this code deepspeed.DeepSpeedEngine.save_checkpoint(save_dir=save_dir , exclude_frozen_parameters=True)

but, save_checkpoint only saves the optimizer state, model state is not saved.

-rw-rw-r-- 1 519K 09:32 zero_pp_rank_0_mp_rank_00_model_states.pt -rw-rw-r-- 1 478M 09:32 zero_pp_rank_0_mp_rank_00_optim_states.pt -rw-rw-r-- 1 519K 09:32 zero_pp_rank_1_mp_rank_00_model_states.pt -rw-rw-r-- 1 478M 09:32 zero_pp_rank_1_mp_rank_00_optim_states.pt

When I save the trained model, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight??

Hi @momozzing, I haven't run the code, but isn't zero_pp_rank_0_mp_rank_00_model_states.pt the model state? Since you specified exclude_frozen_parameters=True, it only has parameters that are trained for LoRA.

You can find an example of the combination of ZeRO3 and LoRA in DeepSpeed-Chat. In the following example, it saves all the parameters including ones for LoRA. https://github.com/microsoft/DeepSpeedExamples/blob/ccb2a3400a05ea075b643bb3aeabb02f9883c5da/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L385

Hi @tohtana

LLAMA + QLoRA without deepspeed stores the size of the adapter_model.bin at 477MB.

LLAMA + QLoRA with deepspeed zero2 stores the size of the adapter_model.bin at 477MB.

but, LLAMA + QLoRA with deepspeed zero3 stores the size of the adapter_model.bin at 519K.

so, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight with deepspeed zero3??

Does deepspeed zero3 support bitsandbytes?

Hi @momozzing

ZeRO3 sets an empty size (Size([0]) to a parameter object and has real tensor data in a different attribute. We cannot say that parameters are not saved even when we see torch.Size([0]) in the error message. ZeRO3 also saves partitioned parameters, which are in a different format from the normal PyTorch's checkpoint. So we need to use DeepSpeed's API to load a checkpoint. In your code, you use AutoModelForCausalLM.from_pretrained(). This cannot properly load a checkpoint that ZeRO3 saved.

Here is another example using HF trainer and LoRA. This script seems to save parameters properly. Can you check this as well? https://github.com/tohtana/ds_repro_4295/blob/main/finetune_llama_v2.py

Hi, @tohtana As you said, using DeepSpeed's API solved the problem.

Here's how I solved it.

state_dict = self.engine._zero3_consolidated_16bit_state_dict()
lora_state_dict = get_peft_model_state_dict(self.model, state_dict)
self.model.save_pretrained(save_dir)
torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))

Thank you very much for your reply.

this is a workaround, not a proper solution as this can be really expensive:

state_dict = self.engine._zero3_consolidated_16bit_state_dict()

get_peft_model_state_dict ideally needs to be fixed to become ZeRO aware - it'll need to do that for Deepspeed ZeRO and FSDP as well. In the case of Deepspeed it needs to gather the weights like it's done here:

https://github.com/huggingface/transformers/blob/81c8191b4651de216c00e25e1af607683e980614/src/transformers/modeling_utils.py#L605-L620

This is the efficient way of doing that as it'd gather one layer at a time and incur little memory overhead.

with zero.init enabled, I get below with the latest branch of Accelerate, Transformers and latest release of Deepspeed:


File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
return model_class.from_pretrained(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
) = cls._load_pretrained_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    raise ValueError(set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)

File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
ValueError        : raise ValueError(raise ValueError(Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

ValueErrorValueError: : Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

2. Below is the memory usage when `zero_init=False` and qlora+deepSpeed stage 3 for Llama 70B. GPU memory usage per GPU: 20% of 80Gb = 16GB per GPU. However, the initial memory per GPU during model loading would be 35GB (0.5*70B) as each GPU loads the pretrained model in 4 bits. If `zero_init` is enabled with QLoRA, then one could finetune 70B model on 8 24GB GPUs which would be great.
Code: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/sft/training
Command:

accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml" train.py \ --seed 100 \ --model_name_or_path "meta-llama/Llama-2-70b-hf" \ --dataset_name "smangrul/ultrachat-10k-chatml" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "mistral-sft-lora-ds" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --gradient_checkpointing True \ --use_reentrant True \ --dataset_text_field "content" \ --use_flash_attn True \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16"



<img width="714" alt="Screenshot 2024-03-05 at 6 30 39 PM" src="https://github.com/microsoft/DeepSpeed/assets/13534540/ac0b2831-bcda-460c-a4ab-b9cb4d300d35">

microsoft / DeepSpeed