I ran the train.py and got an EOFError,however the Process finished with exit code 0. It seems that the main process finished with no error but the child processes encountered some bugs.
After checked the code, I found that some daemon were used to load the training data into train_queue. What made me confused is that these daemon have a While True Loop and I could not find the code designed to stop them. So I am wondering if the EOFError is expected.
Thanks in advance for any info you can provide!
Here is the log in trainning stage:
/home/pyxies/anaconda3/envs/py35/bin/python /home/pyxies/code/recurrent-relational-networks/tasks/babi/train.py
Using TensorFlow backend.
WARNING: No GPU's found. Using CPU
Using devices: ['cpu:0']
Preparing data...
2019-08-07 00:38:31.491702: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Creating graph...
/home/pyxies/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
word-embeddings/embeddings:0 (177, 32)
fact-encoder/lstm_cell/kernel:0 (64, 128)
fact-encoder/lstm_cell/bias:0 (128,)
question-encoder/lstm_cell/kernel:0 (64, 128)
question-encoder/lstm_cell/bias:0 (128,)
pre/fully_connected/weights:0 (104, 128)
pre/fully_connected/biases:0 (128,)
pre/fully_connected_1/weights:0 (128, 128)
pre/fully_connected_1/biases:0 (128,)
pre/fully_connected_2/weights:0 (128, 128)
pre/fully_connected_2/biases:0 (128,)
pre/fully_connected_3/weights:0 (128, 128)
pre/fully_connected_3/biases:0 (128,)
steps/message-fn/fully_connected/weights:0 (288, 128)
steps/message-fn/fully_connected/biases:0 (128,)
steps/message-fn/fully_connected_1/weights:0 (128, 128)
steps/message-fn/fully_connected_1/biases:0 (128,)
steps/message-fn/fully_connected_2/weights:0 (128, 128)
steps/message-fn/fully_connected_2/biases:0 (128,)
steps/message-fn/fully_connected_3/weights:0 (128, 128)
steps/message-fn/fully_connected_3/biases:0 (128,)
steps/post-fn/fully_connected/weights:0 (256, 128)
steps/post-fn/fully_connected/biases:0 (128,)
steps/post-fn/fully_connected_1/weights:0 (128, 128)
steps/post-fn/fully_connected_1/biases:0 (128,)
steps/post-fn/fully_connected_2/weights:0 (128, 128)
steps/post-fn/fully_connected_2/biases:0 (128,)
steps/post-fn/fully_connected_3/weights:0 (128, 128)
steps/post-fn/fully_connected_3/biases:0 (128,)
steps/lstm_cell/kernel:0 (256, 512)
steps/lstm_cell/bias:0 (512,)
steps/graph-sum/graph-fn/fully_connected/weights:0 (128, 128)
steps/graph-sum/graph-fn/fully_connected/biases:0 (128,)
steps/graph-sum/graph-fn/fully_connected_1/weights:0 (128, 128)
steps/graph-sum/graph-fn/fully_connected_1/biases:0 (128,)
steps/graph-sum/graph-fn/fully_connected_2/weights:0 (128, 177)
steps/graph-sum/graph-fn/fully_connected_2/biases:0 (177,)
441681
Starting data loaders...
Waiting for queue to fill...
val 6.012589 batches/s, 1 starved 1 total qsize 0
val 7.400931 batches/s, 2 starved 2 total qsize 1
val 5.558665 batches/s, 2 starved 3 total qsize 3
val 7.163031 batches/s, 2 starved 4 total qsize 2
train_qsize: 1, val_qsize: 2
......<omit some output here>
val 15.590201 batches/s, 118 starved 216 total qsize 3
val 15.654320 batches/s, 118 starved 217 total qsize 3
val 15.718273 batches/s, 118 starved 218 total qsize 2
train_qsize: 100, val_qsize: 100
00000/00100 1.310306 updates/s 5.256005 loss
val 9.275140 batches/s, 118 starved 219 total qsize 83
00010/00100 0.276424 updates/s 5.061219 loss
val 3.525461 batches/s, 118 starved 220 total qsize 100
00020/00100 0.327731 updates/s 4.658732 loss
val 2.330294 batches/s, 118 starved 221 total qsize 100
00030/00100 0.322194 updates/s 4.153487 loss
val 1.736846 batches/s, 118 starved 222 total qsize 100
00040/00100 0.359319 updates/s 3.952793 loss
val 1.418118 batches/s, 118 starved 223 total qsize 100
00050/00100 0.347661 updates/s 3.769470 loss
val 1.196272 batches/s, 118 starved 224 total qsize 100
00060/00100 0.318332 updates/s 3.534736 loss
val 1.022213 batches/s, 118 starved 225 total qsize 100
00070/00100 0.330162 updates/s 3.429694 loss
val 0.896351 batches/s, 118 starved 226 total qsize 100
00080/00100 0.347375 updates/s 3.273200 loss
val 0.804948 batches/s, 118 starved 227 total qsize 100
00090/00100 0.350296 updates/s 3.312094 loss
val 0.730228 batches/s, 118 starved 228 total qsize 100
Process Process-7:
Traceback (most recent call last):
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
queue.put(self.get_batch(is_training))
File "<string>", line 2, in put
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
kind, result = conn.recv()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Process Process-5:
Process Process-4:
Process Process-6:
Traceback (most recent call last):
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
queue.put(self.get_batch(is_training))
File "<string>", line 2, in put
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
kind, result = conn.recv()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Traceback (most recent call last):
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
queue.put(self.get_batch(is_training))
File "<string>", line 2, in put
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
kind, result = conn.recv()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Traceback (most recent call last):
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
queue.put(self.get_batch(is_training))
File "<string>", line 2, in put
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
kind, result = conn.recv()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Process finished with exit code 0
I ran the
train.py
and got an EOFError,however theProcess finished with exit code 0
. It seems that the main process finished with no error but the child processes encountered some bugs.After checked the code, I found that some daemon were used to load the training data into
train_queue
. What made me confused is that these daemon have aWhile True Loop
and I could not find the code designed to stop them. So I am wondering if the EOFError is expected.Thanks in advance for any info you can provide!
Here is the log in trainning stage: