Open bobo0810 opened 1 year ago
Different weight
Different output
dp output.shape---> torch.Size([4, 1000])
dp output---> tensor([[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
pp output.shape---> torch.Size([4, 1000])
pp output---> tensor([[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011]],
device='cuda:1', grad_fn=<CatBackward0>)
两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
In the two training methods, the specific operation of feeding the model data is different, and even the data batch may be random, and the specific parameters are also different.
两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同
同一个模型,加载相同的预训练权重。但是
(1)保存出来的权重不同
(2)相同的全1输入,模型设置为eval。但输出结果不同。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
For the two training methods, the specific operation of feeding the model data is different, even the data batch may be random, and the specific parameters are different
The same model, loaded with the same pre-trained weights. but (1) The saved weights are different (2) The same all 1 input, the model is set to eval. But the output is different.
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
两种训练方式,喂给模型数据的具体操作不一样,甚至数据batch也可能是随机的,具体的参数也就不同
同一个模型,加载相同的预训练权重。但是 (1)保存出来的权重不同 (2)相同的全1输入,模型设置为eval。但输出结果不同。
How does it compare with the original model?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Two training methods, the specific operation of feeding the model data is different, even the data batch may be random, and the specific parameters are different
The same model, loaded with the same pre-trained weights. But (1) the saved weights are different (2) the same all 1 input, the model is set to eval. But the output is different.
How does it compare with the original model?
How does it compare with the original model?
The original model is consistent with the output of data parallel
# origin model's output
batch=(4, 3, 256, 256)
tensor([[-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707],
[-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707],
[-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707],
[-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707]],
grad_fn=<AddmmBackward0>)
# DP model's output
dp output.shape---> torch.Size([4, 1000])
dp output---> tensor([[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678],
[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678]],
device='cuda:0', grad_fn=<ToCopyBackward0>)
# PP model's output
pp output.shape---> torch.Size([4, 1000])
pp output---> tensor([[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011],
[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011]],
device='cuda:1', grad_fn=<CatBackward0>)
How does it compare with the original model?
The original model is consistent with the output of data parallel
# origin model's output batch=(4, 3, 256, 256) tensor([[-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707], [-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707], [-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707], [-0.3087, 0.1614, -1.3962, ..., -1.7141, 0.1599, 0.1707]], grad_fn=<AddmmBackward0>) # DP model's output dp output.shape---> torch.Size([4, 1000]) dp output---> tensor([[-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678], [-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678], [-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678], [-0.3079, 0.1619, -1.3984, ..., -1.7158, 0.1627, 0.1678]], device='cuda:0', grad_fn=<ToCopyBackward0>) # PP model's output pp output.shape---> torch.Size([4, 1000]) pp output---> tensor([[-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011], [-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011], [-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011], [-0.0096, -0.0213, 0.0097, ..., 0.0020, 0.0213, -0.0011]], device='cuda:1', grad_fn=<CatBackward0>)
The output of "PP" only uses the "PP" strategy, doesn't it? In addition, the pipeline parallelism strategy is currently being refactored, and we will address this issue as well.
The output of "PP" only uses the "PP" strategy, doesn't it? In addition, the pipeline parallelism strategy is currently being refactored, and we will address this issue as well.
Yes. I understand that pipelining is not currently loading pre-training weights properly
🐛 Describe the bug
Code
Refer to this code https://github.com/hpcaitech/ColossalAI/blob/02192a632e6c6f965d93ec79937f97e10e121307/examples/tutorial/hybrid_parallel/train.py#L73
My code
Environment
Python 3.8.8 Torch 1.13.1 colossalai 0.2.8