mingyuliutw / UNIT

Unsupervised Image-to-Image Translation
Other
1.99k stars 362 forks source link

Segmentation fault problem #13

Closed lightChaserX closed 7 years ago

lightChaserX commented 7 years ago

Hi Dr. Liu,

I ran the code of training attributed-based face images translation.

When the iteration is about 100, the training will end and encounter a segmentation fault problem.

i.e.,

Iteration: 00000092/02000000
Iteration: 00000093/02000000
Iteration: 00000094/02000000
Iteration: 00000095/02000000
Iteration: 00000096/02000000
Iteration: 00000097/02000000
Iteration: 00000098/02000000
Iteration: 00000099/02000000
Iteration: 00000100/02000000
Segmentation fault

the stack-trace information

Iteration: 00000101/02000000
Iteration: 00000102/02000000
Iteration: 00000103/02000000
Iteration: 00000104/02000000
Iteration: 00000105/02000000
Iteration: 00000106/02000000
Iteration: 00000107/02000000
Iteration: 00000108/02000000

Program received signal SIGSEGV, Segmentation fault.
0x0000555555632cb0 in ?? ()
(gdb) where
#0  0x0000555555632cb0 in ?? ()
#1  0x0000555555632d95 in ?? ()
#2  0x0000555555631f45 in ?? ()
#3  0x0000555555629b64 in _PyObject_GC_Malloc ()
#4  0x000055555562962d in _PyObject_GC_New ()
#5  0x000055555567d991 in ?? ()
#6  0x000055555566b87f in PyObject_GetIter ()
#7  0x000055555564ff90 in PyEval_EvalFrameEx ()
#8  0x000055555564d285 in PyEval_EvalCodeEx ()
#9  0x000055555566a08e in ?? ()
#10 0x000055555563b983 in PyObject_Call ()
#11 0x0000555555659460 in PyEval_CallObjectWithKeywords ()
#12 0x00007fff8f37becd in THPFunction_apply (cls=0x5555569afc80, _inputs=0x7ffff342b050) at torch/csrc/autograd/python_function.cpp:721
#13 0x000055555564f1aa in PyEval_EvalFrameEx ()
#14 0x000055555564d285 in PyEval_EvalCodeEx ()
#15 0x0000555555654d49 in PyEval_EvalFrameEx ()
#16 0x000055555564d285 in PyEval_EvalCodeEx ()
#17 0x000055555566a248 in ?? ()
#18 0x000055555563b983 in PyObject_Call ()
#19 0x00005555556516bd in PyEval_EvalFrameEx ()
#20 0x000055555564d285 in PyEval_EvalCodeEx ()
#21 0x000055555566a08e in ?? ()
#22 0x000055555563b983 in PyObject_Call ()
#23 0x00005555556805de in ?? ()
#24 0x000055555563b983 in PyObject_Call ()
#25 0x00005555556de6a7 in ?? ()
#26 0x000055555563b983 in PyObject_Call ()
#27 0x0000555555654c5f in PyEval_EvalFrameEx ()
#28 0x000055555564d285 in PyEval_EvalCodeEx ()
#29 0x000055555566a248 in ?? ()
#30 0x000055555563b983 in PyObject_Call ()
#31 0x00005555556516bd in PyEval_EvalFrameEx ()
#32 0x000055555564d285 in PyEval_EvalCodeEx ()
#33 0x000055555566a08e in ?? ()
#34 0x000055555563b983 in PyObject_Call ()
#35 0x00005555556805de in ?? ()
#36 0x000055555563b983 in PyObject_Call ()
#37 0x00005555556de6a7 in ?? ()
#38 0x000055555563b983 in PyObject_Call ()
#39 0x0000555555654c5f in PyEval_EvalFrameEx ()
#40 0x000055555564d285 in PyEval_EvalCodeEx ()
#41 0x000055555566a248 in ?? ()
#42 0x000055555563b983 in PyObject_Call ()
#43 0x00005555556516bd in PyEval_EvalFrameEx ()
#44 0x000055555564d285 in PyEval_EvalCodeEx ()
#45 0x000055555566a08e in ?? ()
#46 0x000055555563b983 in PyObject_Call ()
#47 0x00005555556805de in ?? ()
#48 0x000055555563b983 in PyObject_Call ()
#49 0x00005555556de6a7 in ?? ()
#50 0x000055555563b983 in PyObject_Call ()
#51 0x0000555555654c5f in PyEval_EvalFrameEx ()
#52 0x000055555564d285 in PyEval_EvalCodeEx ()
#53 0x000055555566a248 in ?? ()
#54 0x000055555563b983 in PyObject_Call ()
---Type <return> to continue, or q <return> to quit---return
#55 0x00005555556516bd in PyEval_EvalFrameEx ()
#56 0x000055555564d285 in PyEval_EvalCodeEx ()
#57 0x000055555566a08e in ?? ()
#58 0x000055555563b983 in PyObject_Call ()
#59 0x00005555556805de in ?? ()
#60 0x000055555563b983 in PyObject_Call ()
#61 0x00005555556de6a7 in ?? ()
#62 0x000055555563b983 in PyObject_Call ()
#63 0x0000555555654c5f in PyEval_EvalFrameEx ()
#64 0x000055555564d285 in PyEval_EvalCodeEx ()
#65 0x000055555566a248 in ?? ()
#66 0x000055555563b983 in PyObject_Call ()
#67 0x00005555556516bd in PyEval_EvalFrameEx ()
#68 0x000055555564d285 in PyEval_EvalCodeEx ()
#69 0x000055555566a08e in ?? ()
#70 0x000055555563b983 in PyObject_Call ()
#71 0x00005555556805de in ?? ()
#72 0x000055555563b983 in PyObject_Call ()
#73 0x00005555556de6a7 in ?? ()
#74 0x000055555563b983 in PyObject_Call ()
#75 0x0000555555654c5f in PyEval_EvalFrameEx ()
#76 0x0000555555654a4f in PyEval_EvalFrameEx ()
#77 0x000055555564d285 in PyEval_EvalCodeEx ()
#78 0x000055555565555b in PyEval_EvalFrameEx ()
#79 0x000055555564d285 in PyEval_EvalCodeEx ()
#80 0x000055555564d029 in PyEval_EvalCode ()
#81 0x000055555567d42f in ?? ()
#82 0x00005555556783a2 in PyRun_FileExFlags ()
#83 0x0000555555677eee in PyRun_SimpleFileExFlags ()
#84 0x0000555555628ee1 in Py_Main ()
#85 0x00007ffff6f14b45 in __libc_start_main (main=0x555555628810 <main>, argc=8, argv=0x7fffffffeba8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffeb98) at libc-start.c:287
#86 0x000055555562870a in _start ()
mingyuliutw commented 7 years ago

This is strange. This is my first time seeing this error message. What's your environment? Ubuntu 16.04. CUDA8? Pascal Titan X GPU card?

lightChaserX commented 7 years ago

I ran the code on two machines:

  1. A server with Debian 64bit, CUDA 8.0, V8.0.44 and Nvidia Tesla K80
  2. DGX-1 with Ubuntu 14.04, CUDA 8.0, V8.0.44 and Tesla P100

B.T.W, when I train cocogan_train_domain_adaptation.py, it's OK.

mingyuliutw commented 7 years ago

@JhonsonWanger I run the code again on all the machines I have access at this point but I cannot reproduce the problem you encountered. Are you using Python 2.7 from Anaconda and the latest PyTorch libraries?

lightChaserX commented 7 years ago

Hi, Dr. Liu:

Thanks a lot! I have fixed this issue.

This above issue may be caused by dependency bug. I previously installed these dependencies directly.

So, this time, I install using conda and change the cudnn version to 6.0 (previously 5.1). No issues occured.