resume training does not work for multi-gpus training

forever208 commented 2 years ago

I add --resume_checkpoint $path_to_checkpoint$ to continue the training, it works for a single GPU, but does not work for multi-gpus

the code gets stuck here:

Logging to /proj/ihorse_2021/users/x_manni/guided-diffusion/log9 creating model and diffusion... creating data loader... start training... loading model from checkpoint: /proj/ihorse_2021/users/x_manni/guided-diffusion/log9/model200000.pt...

VigneshSrinivasan10 commented 2 years ago

@forever208 I have the same problem and the code gets stuck forever.

On further investigation, I found that the test script image_sample.py reloads the model. Here is the difference: The test script reloads the model before placing the model on cuda. However, the training script already has the model on cuda and this leads to the problem of getting stuck there. Upon debugging, I found the code to get stuck on this line in dist_util.py: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/dist_util.py#L67

It is not clear to me why this fails. Any pointers in fixing this problem would be greatly appreciated. Thanks in advance.

bahjat-kawar commented 2 years ago

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py. I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51. Hope this helps.

VigneshSrinivasan10 commented 2 years ago

@bahjat-kawar Thanks for tip and sorry for the delay in my response. Your suggestion fixed the problem.

VigneshSrinivasan10 commented 2 years ago

@bahjat-kawar Although the model reloading was successful, I still face loss values going to NAN after retraining for a few iterations. All the three .pt files were reloaded, but this issue still persists. I assumed the opt.pt file should have some information of the optimizer parameters which should help continue the training.

Did you also face this issue?

JiamingLiu-Jeremy commented 2 years ago

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

forever208 commented 2 years ago

@VigneshSrinivasan10 I met the similar problem. Any progress on fixing this issue ?

solution: remove if dist.get_rank() == 0 in script train_util.py when loading checkpints, because each GPU need to load checkpoint

ONobody commented 1 year ago

@forever208 Hello, do you use opt, model or ema's. pt file when using resume_checkpoint? Or put their. pt files in a folder as the path of resume_checkpoint.

forever208 commented 1 year ago

@ON I use model to do resume training (where both ema and opt will be loaded). use ema to do sampling

ONobody commented 1 year ago

@forever208 When I continue to train, such as python image_train.py --resume_checkpoint path/modelXX.pt? think you very much

forever208 commented 1 year ago

@ONobody exactly If you have further trouble, take a look at this pull request Fix resumed model training for Multi-GPUs

ONobody commented 1 year ago

@forever208 Thank you very much.

ONobody commented 1 year ago

@forever208 Hello, I would like to ask how to train Classifier guidance on my own data set. Do you need to change any codes? I always make mistakes.

forever208 commented 1 year ago

@ONobody I have no experience of using the classifier guidance, sorry for not being able to help you in this case

ONobody commented 1 year ago

@forever208 What about the calculation of FID IS and other evaluation indicators? I don't know how to calculate it here

forever208 commented 1 year ago

@ONobody the author provides the instructions: https://github.com/openai/guided-diffusion/tree/main/evaluations

ONobody commented 1 year ago

@forever208 The diffusion I trained is based on my own data set. How to evaluate this thank you.

forever208 commented 1 year ago

@ONobody if your own dataset only has one class, you can randomly draw 50k samples to form the reference_batch. Then you generated 50k samples using your trained model. Computing the FID by running the script

$ python evaluator.py reference_batch.npz 50k_samples.npz

If your own dataset has more than 1 class, you'd better use the whole training set as the reference_batch.

remember to convert your data into .npz format

ONobody commented 1 year ago

@forever208 The picture size of my dataset is not 256. But the size of the picture I generated is 256. Do I need to change the picture size of my data set to 256? thank you

forever208 commented 1 year ago

@ONobody you have to keep them the same size. For example, your training data must be resized to 256 when doing the training. Then, your model generates a 256*256 sample.

ONobody commented 1 year ago

@forever208 When I make an assessment, Convert your own data set into 256*256. And then convert it to npz format, right?

forever208 commented 1 year ago

@ONobody convert the training data into 256x256 --> training the model --> sampling 50k samples (256x256) from the model --> convert both referench batch (256x256) and 50k samples (256x256) into npz file --> compute FID

ONobody commented 1 year ago

@forever208 Thank you very much. When converting my data set to npz format and then calculating fid. Is it the wrong way for me to convert?

open11012 commented 5 months ago

@VigneshSrinivasan10 Not sure if it's the same issue, but I also found the code to stop at the same line when running classifier_train.py. I fixed the issue by indenting load_state_dict out of the if dist.get_rank() == 0 block on line 51. Hope this helps.

Thanks! I meet the same problem. It works in my code!

openai / guided-diffusion

resume training does not work for multi-gpus training #23