[BUG]: GeminiAdamOptimizer, can't load saved optimizer correctly

shileims commented 1 year ago

🐛 Describe the bug

Hi colossalai, I am trying to use colossalai to fine-tune stable diffusion. In the code, optimizer is defined as GeminiAdamOptimizer. I used the following code to define optimizer and save optimizer:

optimizer = GeminiAdamOptimizer(unet, lr=args.learning_rate, initial_scale=2**5, clipping_norm=args.max_grad_norm)
checkpoint = { 'optimizer': optimizer.state_dict(), 'lr_scheduler': lr_scheduler.state_dict() } torch.save(checkpoint, ‘optimizer_lr_scheduler.pth')

After that, I will try to load it using the following code:

optimizer = GeminiAdamOptimizer(unet, lr=args.learning_rate, initial_scale=2**5, clipping_norm=args.max_grad_norm)
model_dict = torch.load('optimizer_lr_scheduler.pth')
optimizer_dict = model_dict['optimizer']
optimizer.optim.load_state_dict(optimizer_dict)

It has the error: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

In terms of the error, I try to print each parameter in "param_groups" of the optimizer_dict. I found that the "param" keyword in "param_groups" of the optimizer_dict are different. Actually, it is not fixed and different among different GPUs. The follow are print info from different gpus, each GPU is a process: Left is info I loaded from saved optimizer dict, the right is info I get from defined optimizer of the training code.

GPU1: lr 5e-06 vs 3.4928697265869516e-06 betas (0.9, 0.999) vs (0.9, 0.999) eps 1e-08 vs 1e-08 weight_decay 0 vs 0 bias_correction True vs True params [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] vs [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]

GPU2: lr 5e-06 vs 3.4928697265869516e-06 betas (0.9, 0.999) vs (0.9, 0.999) eps 1e-08 vs 1e-08 weight_decay 0 vs 0 bias_correction True vs True params [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] vs [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73]

Environment

Python3.8, pytorch 1.9.1+cuda102

JThh commented 1 year ago

Can you try out this checkpoint?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Can you try out our new checkpoint io function?

shileims commented 1 year ago

HI @JThh , Thanks for your reply. I would like to ask how to save and load lr_scheduler? I think the example you pointed out doesn't include lr_scheduler example. Thanks

JThh commented 1 year ago

I think for lr scheduler you may use torch save to save it separately!

shileims commented 1 year ago

Hi @JThh , Thanks for your reply. Another questions, I found the code you pointed out to me shows: The saved optimizer is loaded by only the main process (local_rank == 0). But I think for distributed training, the saved optimizer should be loaded by each process. If I am wrong, would you point out what mistake I made? Any help would be appreciated, thanks!

JThh commented 1 year ago

Hi, I have replied to you at #2462 .

shileims commented 1 year ago

hi @JThh , really appreciate your help, thanks!

shileims commented 1 year ago

Hi @JThh , I used the following code to save optimizer, but it shows timeout error:

Code: rank = dist.get_rank()

mapping = dict()
optim_state = optimizer.state_dict()
for k, v in optim_state['state'].items():
    for n, t in v.items():
        if isinstance(t, ColoTensor):
            mapping[(k, n)] = t.dist_spec
            gather_tensor(t)

if rank == 0:
    save_state = {'optimizer': optim_state}
    if file_name == None:
        torch.save(save_state, f'{output_dir}/{epoch}_{-1}_optimizer.pth')
    else:
        torch.save(save_state, file_name)

    for k, v in optimizer.state_dict()['state'].items():
        for n, t in v.items():
            if isinstance(t, ColoTensor):
                assert hasattr(t, 'save_ready')
                t.set_dist_spec(mapping[(k, n)])
                delattr(t, 'save_ready')

del optim_state
del mapping
dist.barrier()

Error: INFO colossalai - colossalai - INFO: Saving optimizer
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800136 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800241 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800087 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error what(): ' [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800136 milliseconds before timing out. what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800241 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800087 milliseconds before timing out.

Any help would be appreciated, thank you so much!

JThh commented 1 year ago

Please try syncing the processes via dist.barrier() at the very start. Meanwhile, is your optimizer a member of ColossalaiOptimizer?

shileims commented 1 year ago

Hi @JThh , Thanks for your reply. Actually, I used the following function to define an optimizer:

from colossalai.nn.optimizer.gemini_optimizer import GeminiAdamOptimizer

So I think it is a member of colossal.nn.optimizer.

shileims commented 1 year ago

HI @JThh , While loading saved the optimizer, it shows the following error:

Error: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Code: def load_optimizer_checkpoint(output_dir, epoch, optimizer, file_name=None): dist.barrier() rank = dist.get_rank()

mapping = dict()
for k, v in optimizer.state_dict()['state'].items():
    for n, t in v.items():
        if isinstance(t, ColoTensor):
            mapping[(k, n)] = t.dist_spec
            gather_tensor(t)

if rank == 0:
    if file_name == None:
        colo_checkpoint = torch.load(f'{output_dir}/{epoch}_{-1}_optimizer.pth')
    else:
        colo_checkpoint = torch.load(file_name)
    optimizer.load_state_dict(colo_checkpoint['optimizer'])
dist.barrier()

for k, v in optimizer.state_dict()['state'].items():
    for n, t in v.items():
        if isinstance(t, ColoTensor):
            scatter_tensor(t, mapping[(k, n)])

del mapping

Any suggestion would be appreciated!

JThh commented 1 year ago

Can you print the contents of your optimizers before saving?

print("Optimizer's state_dict:")
if rank == 0:
    print(optim_state)

And compare it with the saved optimizer state?

if rank == 0:
  colo_checkpoint = torch.load(file_name)
  print(colo_checkpoint['optimizer'])

shileims commented 1 year ago

Hi @JThh , Thanks for your reply. The info is the same as what I listed at the beginning of the error:

GPU1: saved optimizer['params'] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] defined optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]

GPU2: saved optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] defined optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73]

Any suggestion would be appreciated, thanks!

shileims commented 1 year ago

Hi @JThh , Sorry for late reply. I found sometimes, it works for saving and loading optimizer. But sometimes, it doesn't work. Normally, if I use 16 gpus to train the model, save and load the optimizer successfully. But when I use 256 gpus to save optimizer, it fails. The following error happens while saving optimizer:

Any suggestion would be appreciated.

hpcaitech / ColossalAI

[BUG]: GeminiAdamOptimizer, can't load saved optimizer correctly #3580

🐛 Describe the bug

Environment