microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

[Kosmos-2] Hanging at the end of training #1250

Open JierunChen opened 1 year ago

JierunChen commented 1 year ago

I run the training code for 1 update. The process hands and does exit after showing the message "INFO:fairseq_cli.train:done training in 117.2 seconds". Any idea on how to address the issue?

pengzhiliang commented 1 year ago

Hi, @JierunChen. It seems that you loaded our released ckpt, and then the total/max updated steps are consistent with that in ckpt, so the code skipped the training phase directly.

JierunChen commented 1 year ago

@pengzhiliang Hi, you are right on that. But my question is about the program hanging at the end of training. For example, when the training finished with the output "INFO:fairseq_cli.train:done training in 82265.6 seconds", the program does not exit and continue to occupy the computing resource.