hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.76k stars 4.34k forks source link

[BUG]: Data parallelism and model parallelism have completely different weights #4319

Open bobo0810 opened 1 year ago

bobo0810 commented 1 year ago

🐛 Describe the bug

Code

Refer to this code https://github.com/hpcaitech/ColossalAI/blob/02192a632e6c6f965d93ec79937f97e10e121307/examples/tutorial/hybrid_parallel/train.py#L73

My code

    # 开启流水并行
    if use_pipeline:
        # 流水线上下文管理器,将模型切分成流水阶段
        pipelinable = PipelinableContext()
        with pipelinable:
            from torchvision.models import resnet50

            model = resnet50(pretrained=True)
        exec_seq = [
            "conv1",
            "bn1",
            "relu",
            "maxpool",
            "layer1",
            "layer2",
            "layer3",
            "layer4",
            "avgpool",
            (lambda x: torch.flatten(x, 1), "behind"),
            "fc",
        ]
        pipelinable.to_layer_list(exec_seq)

        # 将模型切分成流水线阶段   num_chunks=1指交错式流水并行
        model = pipelinable.partition(
            1, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE)
        )
    else:
        model = resnet50(pretrained=True)
   ...
   ...
   # 保存模型
    save_checkpoint(
        "xxx.pth",
        epoch=1,
        model=engine.model,
    )

Environment

Python 3.8.8 Torch 1.13.1 colossalai 0.2.8

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
bobo0810 commented 1 year ago

Result

Different weight image

Different output

dp output.shape---> torch.Size([4, 1000])
dp output---> tensor([[-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

pp output.shape---> torch.Size([4, 1000])
pp output---> tensor([[-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011]],
       device='cuda:1', grad_fn=<CatBackward0>)
butuizd commented 1 year ago

两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


In the two training methods, the specific operation of feeding the model data is different, and even the data batch may be random, and the specific parameters are also different.

bobo0810 commented 1 year ago

两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同

同一个模型,加载相同的预训练权重。但是 (1)保存出来的权重不同
(2)相同的全1输入,模型设置为eval。但输出结果不同。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


For the two training methods, the specific operation of feeding the model data is different, even the data batch may be random, and the specific parameters are different

The same model, loaded with the same pre-trained weights. but (1) The saved weights are different (2) The same all 1 input, the model is set to eval. But the output is different.

bobo0810 commented 1 year ago

demo.zip

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


demo.zip

flybird11111 commented 1 year ago

两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同

同一个模型,加载相同的预训练权重。但是 (1)保存出来的权重不同 (2)相同的全1输入,模型设置为eval。但输出结果不同。

How does it compare with the original model?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Two training methods, the specific operation of feeding the model data is different, even the data batch may be random, and the specific parameters are different

The same model, loaded with the same pre-trained weights. But (1) the saved weights are different (2) the same all 1 input, the model is set to eval. But the output is different.

How does it compare with the original model?

bobo0810 commented 1 year ago

How does it compare with the original model?

The original model is consistent with the output of data parallel

# origin model's output
batch=(4, 3, 256, 256)
tensor([[-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707]],
       grad_fn=<AddmmBackward0>)

# DP model's output
dp output.shape---> torch.Size([4, 1000])
dp output---> tensor([[-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

# PP model's output
pp output.shape---> torch.Size([4, 1000])
pp output---> tensor([[-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011]],
       device='cuda:1', grad_fn=<CatBackward0>)
flybird11111 commented 1 year ago

How does it compare with the original model?

The original model is consistent with the output of data parallel

# origin model's output
batch=(4, 3, 256, 256)
tensor([[-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707],
        [-0.3087,  0.1614, -1.3962,  ..., -1.7141,  0.1599,  0.1707]],
       grad_fn=<AddmmBackward0>)

# DP model's output
dp output.shape---> torch.Size([4, 1000])
dp output---> tensor([[-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678],
        [-0.3079,  0.1619, -1.3984,  ..., -1.7158,  0.1627,  0.1678]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

# PP model's output
pp output.shape---> torch.Size([4, 1000])
pp output---> tensor([[-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011],
        [-0.0096, -0.0213,  0.0097,  ...,  0.0020,  0.0213, -0.0011]],
       device='cuda:1', grad_fn=<CatBackward0>)

The output of "PP" only uses the "PP" strategy, doesn't it? In addition, the pipeline parallelism strategy is currently being refactored, and we will address this issue as well.

bobo0810 commented 1 year ago

The output of "PP" only uses the "PP" strategy, doesn't it? In addition, the pipeline parallelism strategy is currently being refactored, and we will address this issue as well.

Yes. I understand that pipelining is not currently loading pre-training weights properly