openai / guided-diffusion

MIT License
6.03k stars 803 forks source link

error occurred in MPI_Init_thread #88

Open arunima101 opened 1 year ago

arunima101 commented 1 year ago

I was able to train both the model and classifier but while sampling I got an error like,

No protocol specified [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../src/mca/common/dstore/dstore_segment.c at line 207 [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 693 [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 1850 [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 2808 [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 2857 [hticimage-pc:3252493] PMIX ERROR: OUT-OF-RESOURCE in file ../../../src/server/pmix_server.c at line 3408 [hticimage-pc:3252454] PMIX ERROR: OUT-OF-RESOURCE in file ../../../src/client/pmix_client.c at line 231 [hticimage-pc:3252454] OPAL ERROR: Error in file ext3x_client.c at line 112 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [hticimage-pc:3252454] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! [hticimage-pc:3252493] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99 [hticimage-pc:3252493] PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99 I am not able to debug this. Can anybody help?