Closed jimmysue closed 11 months ago
Not sure if this is the source of your error, but your Linux kernel is relatively old, which can lead to the process hanging, as this warning shows:
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Is it possible for you to upgrade the kernel to a more recent version?
Not sure if this is the source of your error, but your Linux kernel is relatively old, which can lead to the process hanging, as this warning shows:
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Is it possible for you to upgrade the kernel to a more recent version?
Finally, I got it work by run the docker with --network=host
. I wonder if it is possible to work with bridge network in docker?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I have two docker containers as two training nodes. And using diffusers/text_to_image example to run multi-node distributed training. Two containers' hosts are in the same network. And I map port 9001 on container to the host.
I config main node as below:
and config the other one as below
The only difference is to set different rank for machines.
When I launch the training scripts. The script stuck on main node and print below messages:
On the other node, failed, and prints errors below:
Why the training failed, did I make something wrong, please help.