hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
334 stars 102 forks source link

ZeRO without using shard_param #133

Closed powermano closed 2 years ago

powermano commented 2 years ago

🐛 Describe the bug

When i use ZeRO without shard_params, it occurs the following problems

Traceback (most recent call last):
  File "train.py", line 175, in <module>
  File "train.py", line 39, in main
    with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
    self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
    assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
AttributeError: 'int' object has no attribute 'type'

My init code is:

def main():
    parser = colossalai.get_default_parser()
    parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
    args = parser.parse_args()


    logger = get_dist_logger()

    rank = int(os.environ['RANK'])
    # build resnet
    use_zero3 = hasattr(gpc.config, 'zero')
    if use_zero3:
        shard_strategy = TensorShardStrategy()
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
            model = resnet34(num_classes=10)
        model = resnet34(num_classes=10)

my config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

zero = dict(

optimizer = dict(
    # weight_decay=1e-2,

OUTPUT = './'

gradient_clipping = 5.0


pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

ubuntu 18.04

powermano commented 2 years ago

If I modified the code as following, it actually worked.

 rank = int(os.environ['RANK'])
    # build resnet
  use_zero3 = hasattr(gpc.config, 'zero')
  if use_zero3:
      shard_strategy = TensorShardStrategy()

      # with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
      #     model = resnet34(num_classes=10)
      with ZeroInitContext(target_device=torch.device('cuda', rank), shard_strategy=shard_strategy, shard_param=False):
          model = resnet34(num_classes=10)
powermano commented 2 years ago

@fastalgo I do not know how to save the ZeRO model params. When using the save_checkpoint API , the saved file is pretty small.

powermano commented 2 years ago

I tested the ZeRO using private dataset and ir18(which a lit bit different with origin resnet18). The following tabel is the specific results. When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

fp16 = dict(

optimizer = dict(
    # weight_decay=1e-2,
model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%\|██▍       \| 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam |   | single machine + Engine |   ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%\|█▊        \| 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam |   | single machine  + wo Engine + pytorch origin fp16 |   ir18 | private dataset | 2 | 64 | 2 | no ZeRO | 37%\|███▋      \| 1598/4274 [02:32<04:14, 10.50it/s] | 9011M | SGD |   | common data paralle |   ir18 | private dataset | 2 | 64 | 1 | ZeRO + No shard params | 14%\|█▍        \| 606/4275 [01:25<08:27,  7.23it/s] | 9141M | HybridAdam | cuda |   |   ir18 | private dataset | 2 | 64 | 1 | ZeRO + shard params | 13%\|█▎        \| 571/4275 [01:32<10:32,  5.85it/s] | 9073M | HybridAdam | cuda |   |   ir18 | private dataset | 2 | 64 | 1 | ZeRO + shard params | 5%\|▌         \| 217/4275 [01:37<29:16,  2.31it/s] | 6819M | HybridAdam | cpu |   |  
powermano commented 2 years ago

the code without using Engine is shown as following:

model = ...
optimizer = ...
criterion = ...
amp = torch.cuda.amp.grad_scaler.GradScaler(growth_interval=1000)
global_step = 0
for epoch in range(gpc.config.NUM_EPOCHS):
    for idx, (img, label) in enumerate(train_dl):
        img = img.cuda()
        label = label.cuda()
        output, _ = model(img, label)
        train_loss = criterion(output, label)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
powermano commented 2 years ago

The difference between my origin pytorch implementation and colossai is convert_to_amp API which using TorchAMPModel to decorate the origin model. I have tested three different cases:

1 using torch.cuda.amp.autocast(True) inside model forward function:

class model(nn.Module):
    def __init__():
    def forward(self, x, label):
        with torch.cuda.amp.autocast(True):
        return x

2 using @torch.cuda.amp.autocast()

class model(nn.Module):
    def __init__():
    def forward(self, x, label):
        return x

3 using TorchAMPModel

class model(nn.Module):
    def __init__():
    def forward(self, x, label):
        return x

model = model()
model = TorchAMPModel(model)

The first two are normal and only need 5769M GPU memory, but the third one needs 8703M GPU memory

powermano commented 2 years ago

@feifeibear Can you help to verify the above problem? Thanks.

binmakeswell commented 2 years ago

https://github.com/hpcaitech/ColossalAI/issues/1082 Solved