[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Describe the bug When training model using deepspeed 0.14.2. I got this error:

Traceback (most recent call last):                                                                                                            
    File "/opt/anaconda3/envs/openchat/lib/python3.10/runpy.py", line 196, in _run_module_as_main                                       
      return _run_code(code, main_globals, None,                                                                                                
    File "/opt/anaconda3/envs/openchat/lib/python3.10/runpy.py", line 86, in _run_code                                                  
      exec(code, run_globals)                                                                                                                   
    File "/opt/openchat/ochat/training_deepspeed/train.py", line 551, in <module>
      train(args)
    File "/opt/openchat/ochat/training_deepspeed/train.py", line 440, in train                                                          
      model_engine.step()                                                                                                                       
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step                    
      self._take_model_step(lr_kwargs)                                                                                                          
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step        
      self.optimizer.step()                                                                                                                     
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn                    
      ret_val = func(*args, **kwargs)                                                                                                           
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step               
      self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)                                                                        
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn                    
      ret_val = func(*args, **kwargs)                                                                                                           
    File "/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads                                                                                                                                                         
      self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)                                                            
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is sample script to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.integrations import HfDeepSpeedConfig
import argparse
import deepspeed

def main(args):
    deepspeed.init_distributed(dist_backend="nccl")
    dsconfig = HfDeepSpeedConfig(args.deepspeed_config)

    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)

    inputs = tokenizer('This is sample text', return_tensors='pt')
    inputs['labels'] = inputs['input_ids'].clone()
    inputs = inputs.to(args.local_rank)

    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path)

    model.gradient_checkpointing_enable()

    if dsconfig.is_offload():
        optimizer = deepspeed.ops.adam.DeepSpeedCPUAdam(model.parameters(), lr=3e-4)
    else:
        optimizer = deepspeed.ops.adam.FusedAdam(model.parameters(), lr=3e-4)

    model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters(), optimizer=optimizer)

    model_engine.train()

    output = model_engine(**inputs)

    loss = output.loss

    model_engine.backward(loss)

    model_engine.step()

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='My training script.')
    parser.add_argument('--local_rank', type=int, default=-1,
                        help='local rank passed from distributed launcher')
    parser.add_argument('--model_name_or_path', type=str,
                        help='Your model name to be trained')
    # Include DeepSpeed configuration arguments
    parser = deepspeed.add_config_arguments(parser)
    cmd_args = parser.parse_args()

    main(cmd_args)

Here is my deepspeed config json

{
    "bf16": {
        "enabled": true
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "train_micro_batch_size_per_gpu": 1,

    "steps_per_print": 100,
    "wall_clock_breakdown": false
}

To Reproduce Steps to reproduce the behavior:

Run script above with deepspeed the_script.py --model_name_or_path your_model --deepspeed --deepspeed_config your-deepspeed-config.json
See error

Expected behavior There is no error and the script run successfully

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/opt/anaconda3/envs/openchat/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 62.84 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 22.04
4 machine with x2 A6000 each
Python version: 3.10

microsoft / DeepSpeed

[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #5634