Open shileims opened 1 year ago
Can you try out this checkpoint?
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Can you try out our new checkpoint io function?
HI @JThh , Thanks for your reply. I would like to ask how to save and load lr_scheduler? I think the example you pointed out doesn't include lr_scheduler example. Thanks
I think for lr scheduler you may use torch save to save it separately!
Hi @JThh , Thanks for your reply. Another questions, I found the code you pointed out to me shows: The saved optimizer is loaded by only the main process (local_rank == 0). But I think for distributed training, the saved optimizer should be loaded by each process. If I am wrong, would you point out what mistake I made? Any help would be appreciated, thanks!
Hi, I have replied to you at #2462 .
hi @JThh , really appreciate your help, thanks!
Hi @JThh , I used the following code to save optimizer, but it shows timeout error:
Code: rank = dist.get_rank()
mapping = dict()
optim_state = optimizer.state_dict()
for k, v in optim_state['state'].items():
for n, t in v.items():
if isinstance(t, ColoTensor):
mapping[(k, n)] = t.dist_spec
gather_tensor(t)
if rank == 0:
save_state = {'optimizer': optim_state}
if file_name == None:
torch.save(save_state, f'{output_dir}/{epoch}_{-1}_optimizer.pth')
else:
torch.save(save_state, file_name)
for k, v in optimizer.state_dict()['state'].items():
for n, t in v.items():
if isinstance(t, ColoTensor):
assert hasattr(t, 'save_ready')
t.set_dist_spec(mapping[(k, n)])
delattr(t, 'save_ready')
del optim_state
del mapping
dist.barrier()
Error:
INFO colossalai - colossalai - INFO: Saving optimizer
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800136 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800241 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800087 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error what(): '
[Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800136 milliseconds before timing out.
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800241 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800284 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800211 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800087 milliseconds before timing out.
Any help would be appreciated, thank you so much!
Please try syncing the processes via dist.barrier()
at the very start. Meanwhile, is your optimizer a member of ColossalaiOptimizer
?
Hi @JThh , Thanks for your reply. Actually, I used the following function to define an optimizer:
So I think it is a member of colossal.nn.optimizer.
HI @JThh , While loading saved the optimizer, it shows the following error:
Error: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
Code: def load_optimizer_checkpoint(output_dir, epoch, optimizer, file_name=None): dist.barrier() rank = dist.get_rank()
mapping = dict()
for k, v in optimizer.state_dict()['state'].items():
for n, t in v.items():
if isinstance(t, ColoTensor):
mapping[(k, n)] = t.dist_spec
gather_tensor(t)
if rank == 0:
if file_name == None:
colo_checkpoint = torch.load(f'{output_dir}/{epoch}_{-1}_optimizer.pth')
else:
colo_checkpoint = torch.load(file_name)
optimizer.load_state_dict(colo_checkpoint['optimizer'])
dist.barrier()
for k, v in optimizer.state_dict()['state'].items():
for n, t in v.items():
if isinstance(t, ColoTensor):
scatter_tensor(t, mapping[(k, n)])
del mapping
Any suggestion would be appreciated!
Can you print the contents of your optimizers before saving?
print("Optimizer's state_dict:")
if rank == 0:
print(optim_state)
And compare it with the saved optimizer state?
if rank == 0:
colo_checkpoint = torch.load(file_name)
print(colo_checkpoint['optimizer'])
Hi @JThh , Thanks for your reply. The info is the same as what I listed at the beginning of the error:
GPU1: saved optimizer['params'] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] defined optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]
GPU2: saved optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] defined optimizer['params']: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73]
Any suggestion would be appreciated, thanks!
Hi @JThh , Sorry for late reply. I found sometimes, it works for saving and loading optimizer. But sometimes, it doesn't work. Normally, if I use 16 gpus to train the model, save and load the optimizer successfully. But when I use 256 gpus to save optimizer, it fails. The following error happens while saving optimizer:
Any suggestion would be appreciated.
π Describe the bug
Hi colossalai, I am trying to use colossalai to fine-tune stable diffusion. In the code, optimizer is defined as GeminiAdamOptimizer. I used the following code to define optimizer and save optimizer:
After that, I will try to load it using the following code:
It has the error: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
In terms of the error, I try to print each parameter in "param_groups" of the optimizer_dict. I found that the "param" keyword in "param_groups" of the optimizer_dict are different. Actually, it is not fixed and different among different GPUs. The follow are print info from different gpus, each GPU is a process: Left is info I loaded from saved optimizer dict, the right is info I get from defined optimizer of the training code.
GPU1: lr 5e-06 vs 3.4928697265869516e-06 betas (0.9, 0.999) vs (0.9, 0.999) eps 1e-08 vs 1e-08 weight_decay 0 vs 0 bias_correction True vs True params [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] vs [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]
GPU2: lr 5e-06 vs 3.4928697265869516e-06 betas (0.9, 0.999) vs (0.9, 0.999) eps 1e-08 vs 1e-08 weight_decay 0 vs 0 bias_correction True vs True params [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] vs [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73]
Environment
Python3.8, pytorch 1.9.1+cuda102