swook / GazeML

Gaze Estimation using Deep Learning, a Tensorflow-based framework.
MIT License
511 stars 141 forks source link

Program Hang in training stage #37

Open keishatsai opened 5 years ago

keishatsai commented 5 years ago

Hi, I wonder that did anyone get the same problem as I have right now. The training code works fine at the beginning, but it is stuck after saving the checkpoint. There is no error message appears. I have been waiting for hours, and it is still froze.

I0730 11:34:55.581835 18580 time_manager.py:50] 0023481> heatmaps_mse = 0.000482749, radius_mse = 2.46166e-08
I0730 11:34:57.720099 18580 time_manager.py:50] 0023497> heatmaps_mse = 0.000477729, radius_mse = 2.61259e-08
I0730 11:35:03.389537 18580 checkpoint_manager.py:86] CheckpointManager::save_all call done

It is just stuck on the last line above.

Also, situation2, it stuck at this point....."Exiting thread preprocess"

I0730 15:46:04.360245 10444 checkpoint_manager.py:86] CheckpointManager::save_all call done
I0730 15:46:04.368223 16840 data_source.py:253] Exiting thread preprocess_UnityEyes_5
I0730 15:46:04.370217 25984 data_source.py:253] Exiting thread preprocess_UnityEyes_4
I0730 15:46:04.371214 29716 data_source.py:253] Exiting thread preprocess_UnityEyes_3
I0730 15:46:04.371214 13788 data_source.py:253] Exiting thread preprocess_UnityEyes_7
I0730 15:46:04.372213 28312 data_source.py:253] Exiting thread preprocess_UnityEyes_0
I0730 15:46:04.372213 28344 data_source.py:253] Exiting thread preprocess_UnityEyes_6
I0730 15:46:04.372213 28700 data_source.py:253] Exiting thread preprocess_UnityEyes_2
I0730 15:46:04.372213  6704 data_source.py:253] Exiting thread preprocess_UnityEyes_1

I was trying to train from scratch with UnityEyes dataset, and my environment settings are as follow:

Windows10 CUDA 10.0 cuDNN 7.6 Tensorflow-gpu 1.14.0 opencv-python 4.1.0.25 python 3.6

Or do I need something dependency to run this repo? because I have problems to make elg_demo.py run, too.

WuZhuoran commented 5 years ago

I am not sure if this program can run on Windows10.

But how many UnityEyes images do you use?

After you stuck after saving point, did you check if the program is still running? Did you check nvidia-sim whether the elg model is still using full gpu?

Exiting thread problem can be solved by killing process.

keishatsai commented 5 years ago

Hi @WuZhuoran , Thanks for replying. It did use nearly full gpu (7G over 8G memory) while it hung. I made nvidia-smi to check every 5 secs, so I know.

Currently, I prepared 7524 images to train.

What do you mean by " Exiting thread problem can be solved by killing process. " ?

WuZhuoran commented 5 years ago

I mean, if you found you cannot stop the process. You can use command:

kill -9 ${PROCESS_ID}

to exit the process.

keishatsai commented 5 years ago

@WuZhuoran So did you encounter this also? Did you kill the process normally? Actually, I am not quite sure that I have finished training or not. If I kill it, and it means that I have to restart training over and over again.

WuZhuoran commented 5 years ago

@WuZhuoran So did you encounter this also? Did you kill the process normally? Actually, I am not quite sure that I have finished training or not. If I kill it, and it means that I have to restart training over and over again.

@keishatsai I did encounter before. But at most time, I can stop the process normally.