huanngzh / Parts2Whole

[Arxiv 2024] From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation
https://huanngzh.github.io/Parts2Whole/
MIT License
172 stars 7 forks source link

Problem when resuming from the previous checkpoints #12

Open LIAGM opened 2 months ago

LIAGM commented 2 months ago

Hi,

I encounter a problem when I try to resume from the checkpoint and want to continue the training.

The training program is always stuck when resumed from the checkpoint.

For example, here is the screenshot when I try to resume from my checkpoint-6300 with 8 GPUs:

In the beginning, after resuming from the checkpoint, the training process skipped some iterations to match the resume_step.

image

However, after reaching the resume step, the training process is stuck as the following screenshot:

image

This is another example I just let the training program run and got these errors: image

Have you encountered such a problem, or do you have any idea about this?

Thanks for your time and help!