Open Piggy-ch opened 1 year ago
Please note that I previously encountered a situation where it got stuck at this point, and I resolved it by setting environment variables. I'm not sure if this issue is related to the environment variables I set.
[INFO] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
My solution :
export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1;
Just to add, I am running the Fantasia3D model.
Note that this only occurs in the case of generating a specified OBJ file.It is stuck at the step before SDF initialization.
I noticed that #81 encountered a similar issue, but his solution didn't work for me. Here is my DEBUG information. Note: Training works fine with a single card, but not with multiple cards. This freezing issue typically occurs after I had a previous training session that I killed. When I try to train again, it gets stuck and doesn't proceed.