microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.29k stars 4.09k forks source link

[BUG] torch.cat(): expected a non-empty list of Tensors with Deepspeed Zero 3 with offload #4176

Closed thechargedneutron closed 1 month ago

thechargedneutron commented 1 year ago

Describe the bug

I am trying to train LLaVA code and I added a transformer on top of this model. I get the following error when I train the model with deepspeed zero 3 offload. This code does not give this error when I replace stage from 3 -> 2 in the deepspeed_config (also attached below). However, the code goes Out of memory with zero 2 which is again unexpected since the model is 7B parameters and I am using 8 32GB GPUs and this look typical for deepspeed's capabilities.

RuntimeError: torch.cat(): expected a non-empty list of Tensors

Full log where I use a simple transformer encoder:

File "/private/home/ashutoshkr/code/path/to/my/code/debug_classifier_3.py", line 36, in forward
    x = self.transformer_encoder(x, src_key_padding_mask=src_key_padding_mask)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 315, in forward
    output = mod(output, src_mask=mask, is_causal=is_causal, src_key_padding_mask=src_key_padding_mask_for_layers)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 591, in forward
    x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask, is_causal=is_causal))
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 599, in _sa_block
    x = self.self_attn(x, x, x,
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 371, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 483, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 268, in fetch_sub_module
    self.__inflight_param_registry.pop(param).wait()
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 591, in wait
    param.data = instrument_w_nvtx(torch.cat)(partitions).view(param.ds_shape)
  File "/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

ds_report output

[2023-08-19 13:38:00,925] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/private/home/ashutoshkr/.conda/envs/llava/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.5, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

Deepspeed Config

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}

System info (please complete the following information):

Launcher context

I am using deepspeed launcher

Related Issues

https://github.com/huggingface/peft/issues/440

ZhaoChuyang commented 11 months ago

Meet the same problem.

egesko commented 10 months ago

Any updates on this?

ZhaoChuyang commented 10 months ago

I think maybe this is an issue with the implementation of zero3? Before loading the pretrained checkpoint, I find the shape of parameters of all modules is zero dimension, e.g., shape is [0, ...]. After loading from the pretrained checkpoint, the shape go back to normal. This problem happens when I initialize custom modules, which can not find parameters in the checkpoint. But initailize modules outside __init__() resolves this problem, which is weird. I doubt maybe in deepspeed implementation, the shape of parameters are inferred and determined when loading checkpoint?

ethansmith2000 commented 8 months ago

Running into it on Stage1. I'm providing a list of tensors of just a portion as opposed to a generator fwiw, also optimizer=None when calling DS.initialize

Veason-silverbullet commented 2 months ago

Same error. Since I have to use zero-3 (GPU resource is very limited to me, very sad), any solution to this problem instead of using zero-2?

Wang-Xiaodong1899 commented 2 months ago

any update? I can run successfully with Zero-3, but meet this error when using Zero-2. Zero-3 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Zero-2 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Any solution?

Wang-Xiaodong1899 commented 2 months ago

any update? I can run successfully with Zero-3, but meet this error when using Zero-2. Zero-3 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Zero-2 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Any solution?

delete optimizer works!

GuanhuaWang commented 1 month ago

Hi @thechargedneutron , thx for reporting this issue. Could you try run following script with your similar z3+ offload (param+optimizer_state) setting, and see if the error goes away?

just deepspeed BELOW_PYTHON.py --zero 3

# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

import os
import json
import argparse
import torch
import deepspeed
from torch.utils.data.distributed import DistributedSampler
import deepspeed.comm as dist

class SimpleModel(torch.nn.Module):

    def __init__(self, hidden_dim, empty_grad=False):
        super(SimpleModel, self).__init__()
        self.linear = torch.nn.Linear(hidden_dim, hidden_dim)
        self.linear2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.linear4 = torch.nn.Linear(hidden_dim, hidden_dim)
        if empty_grad:
            self.layers2 = torch.nn.ModuleList([torch.nn.Linear(hidden_dim, hidden_dim)])
        self.cross_entropy_loss = torch.nn.CrossEntropyLoss()

    def forward(self, x, y):
        hidden = x
        hidden = self.linear(hidden)
        hidden = self.linear2(hidden)
        hidden = self.linear3(hidden)
        hidden = self.linear4(hidden)
        return self.cross_entropy_loss(hidden, y)

def create_config_from_dict(tmpdir, config_dict):
    config_path = os.path.join(tmpdir, 'temp_config.json')
    with open(config_path, 'w') as fd:
        json.dump(config_dict, fd)
    return config_path

def get_data_loader(model, total_samples, hidden_dim, device):
    batch_size = model.train_micro_batch_size_per_gpu()
    train_data = torch.randn(total_samples, hidden_dim, device=device, dtype=torch.half)
    train_label = torch.empty(total_samples, dtype=torch.long, device=device).random_(hidden_dim)
    train_dataset = torch.utils.data.TensorDataset(train_data, train_label)
    sampler = DistributedSampler(train_dataset)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, sampler=sampler)
    return train_loader

def get_args(tmpdir, config_dict):
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=0)
    parser.add_argument('--zero', type=int, default=0)
    args = parser.parse_args()  #args=''

    config_dict["zero_optimization"]["stage"] = args.zero
    print('config_dict["zero_optimization"]', config_dict["zero_optimization"])
    config_path = create_config_from_dict(tmpdir, config_dict)

    args.deepspeed_config = config_path
    return args

def print0(msg):
    if dist.get_rank() == 0:
        print(msg, flush=True)

rank = int(os.environ['RANK'])
print('seed:', 2222 + rank)
torch.random.manual_seed(2222 + rank)

config_dict = {
    "train_batch_size": 256,
    "steps_per_print": 1,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.00015,
        }
    },
    "fp16": {
        "enabled": True,
        "initial_scale_power": 15
    },
    "zero_optimization": {
        "stage": 3,
        "sub_group_size": 8,
        "reduce_bucket_size": 20,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
    }
}
#        "initial_scale_power": 15
args = get_args('/tmp/', config_dict)
hidden_dim = 8 * 1024

model = SimpleModel(hidden_dim, empty_grad=False)

model, _, _, _ = deepspeed.initialize(args=args,
                                      model=model,
                                      model_parameters=model.parameters(),
                                      dist_init_required=True)

def print_params(tag, model):
    if dist.get_rank() == 0:
        for n, p in model.named_parameters():
            print0("{} {}:{}".format(tag, n, p))

data_loader = get_data_loader(model=model, total_samples=4096, hidden_dim=hidden_dim, device=model.device)
#print_params('pre-train', model)
#while True:
for n, batch in enumerate(data_loader):
    loss = model(batch[0], batch[1])
    if dist.get_rank() == 0:
        print("LOSS:", loss.item())
    model.backward(loss)
    model.step()
    #print_params('step={}'.format(n), model)

It seems work correctly on my side.

z3-offload-op-p
thechargedneutron commented 1 month ago

Hi @GuanhuaWang , I am not actively working on this project now. Can you please tag other people experiencing similar issues? I do not remember the solution I did, either.

Otherwise, we can close this issue.

CC: @Wang-Xiaodong1899 @Veason-silverbullet

GuanhuaWang commented 1 month ago

since no update on this for a week, closed it for now