openai / guided-diffusion

MIT License
6.11k stars 813 forks source link

training error on Ubuntu 22.04 #130

Closed youyuanyi closed 11 months ago

youyuanyi commented 11 months ago

OS: Ubuntu 22.04 Graphic: RTX 3090 Python 3.10 mpi4py: 3.5.1 train_bash.sh

#!/bin/bash

MODEL_FLAGS="--image_size 32 --num_channels 128 --num_res_blocks 3 --learn_sigma True --dropout 0.3 --class_cond True "
DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule cosine"
TRAIN_FLAGS="--lr 1e-4 --batch_size 128"

# Train
python scripts/image_train.py --data_dir ../data $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?

A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python:699378 terminated with signal 6 at PC=7f8740a96a7c SP=7ffe9c925f50.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f8740a96a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f8740a42476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f8740a287f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f873a321b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f8740aeafb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f8740aea781]
python(+0x27257e)[0x55d9043c957e]
python(+0x185c51)[0x55d9042dcc51]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyObject_FastCallDictTstate+0x569)[0x55d9043049b9]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(+0x1a58f6)[0x55d9042fc8f6]
python(_PyObject_FastCallDictTstate+0x30b)[0x55d90430475b]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(+0x18fbc7)[0x55d9042e6bc7]
python(+0x18fc3d)[0x55d9042e6c3d]
python(+0x23d931)[0x55d904394931]
python(PyObject_GetIter+0x16)[0x55d9042a1b76]
python(_PyEval_EvalFrameDefault+0x66a7)[0x55d90432ccc7]
python(+0x240615)[0x55d904397615]
python(+0x191f43)[0x55d9042e8f43]
python(+0x185d61)[0x55d9042dcd61]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(+0x1a579a)[0x55d9042fc79a]
python(_PyEval_EvalCodeWithName+0x4b)[0x55d9042fd14b]
python(PyEval_EvalCodeEx+0x44)[0x55d9042fd194]
python(PyEval_EvalCode+0x1c)[0x55d9042fd1bc]
python(+0x2525cd)[0x55d9043a95cd]
python(+0x276196)[0x55d9043cd196]
python(+0x120091)[0x55d904277091]
python(PyRun_SimpleFileExFlags+0x1c1)[0x55d9043d3ee1]
python(Py_RunMain+0x398)[0x55d9043d45b8]
python(Py_BytesMain+0x39)[0x55d9043d4729]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8740a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8740a29e40]
python(+0x203667)[0x55d90435a667]
sibasmarak commented 11 months ago

Hi, I have not faced this issue, but after reading your error message feels like this is an issue with DataLoader (the fork() was called most likely because num_workers=1 is present here). Now, if there is any data loader issue, I recommend setting num_workers=0 to understand better.

Finally, can you check if this discussion is relevant to your situation or not?

youyuanyi commented 11 months ago

Thank you, it does work!