rasmusbergpalm / recurrent-relational-networks

Code accompanying the paper Recurrent Relational Networks for Complex Relational Reasoning https://arxiv.org/abs/1711.08028
203 stars 35 forks source link

The daemon raised EOFError #3

Closed iamxpy closed 5 years ago

iamxpy commented 5 years ago

I ran the train.py and got an EOFError,however the Process finished with exit code 0. It seems that the main process finished with no error but the child processes encountered some bugs.

After checked the code, I found that some daemon were used to load the training data into train_queue. What made me confused is that these daemon have a While True Loop and I could not find the code designed to stop them. So I am wondering if the EOFError is expected.

Thanks in advance for any info you can provide!

Here is the log in trainning stage:

/home/pyxies/anaconda3/envs/py35/bin/python /home/pyxies/code/recurrent-relational-networks/tasks/babi/train.py
Using TensorFlow backend.
WARNING: No GPU's found. Using CPU
Using devices:  ['cpu:0']
Preparing data...
2019-08-07 00:38:31.491702: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Creating graph...
/home/pyxies/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
word-embeddings/embeddings:0 (177, 32)
fact-encoder/lstm_cell/kernel:0 (64, 128)
fact-encoder/lstm_cell/bias:0 (128,)
question-encoder/lstm_cell/kernel:0 (64, 128)
question-encoder/lstm_cell/bias:0 (128,)
pre/fully_connected/weights:0 (104, 128)
pre/fully_connected/biases:0 (128,)
pre/fully_connected_1/weights:0 (128, 128)
pre/fully_connected_1/biases:0 (128,)
pre/fully_connected_2/weights:0 (128, 128)
pre/fully_connected_2/biases:0 (128,)
pre/fully_connected_3/weights:0 (128, 128)
pre/fully_connected_3/biases:0 (128,)
steps/message-fn/fully_connected/weights:0 (288, 128)
steps/message-fn/fully_connected/biases:0 (128,)
steps/message-fn/fully_connected_1/weights:0 (128, 128)
steps/message-fn/fully_connected_1/biases:0 (128,)
steps/message-fn/fully_connected_2/weights:0 (128, 128)
steps/message-fn/fully_connected_2/biases:0 (128,)
steps/message-fn/fully_connected_3/weights:0 (128, 128)
steps/message-fn/fully_connected_3/biases:0 (128,)
steps/post-fn/fully_connected/weights:0 (256, 128)
steps/post-fn/fully_connected/biases:0 (128,)
steps/post-fn/fully_connected_1/weights:0 (128, 128)
steps/post-fn/fully_connected_1/biases:0 (128,)
steps/post-fn/fully_connected_2/weights:0 (128, 128)
steps/post-fn/fully_connected_2/biases:0 (128,)
steps/post-fn/fully_connected_3/weights:0 (128, 128)
steps/post-fn/fully_connected_3/biases:0 (128,)
steps/lstm_cell/kernel:0 (256, 512)
steps/lstm_cell/bias:0 (512,)
steps/graph-sum/graph-fn/fully_connected/weights:0 (128, 128)
steps/graph-sum/graph-fn/fully_connected/biases:0 (128,)
steps/graph-sum/graph-fn/fully_connected_1/weights:0 (128, 128)
steps/graph-sum/graph-fn/fully_connected_1/biases:0 (128,)
steps/graph-sum/graph-fn/fully_connected_2/weights:0 (128, 177)
steps/graph-sum/graph-fn/fully_connected_2/biases:0 (177,)
441681
Starting data loaders...
Waiting for queue to fill...
val 6.012589 batches/s, 1 starved 1 total qsize 0
val 7.400931 batches/s, 2 starved 2 total qsize 1
val 5.558665 batches/s, 2 starved 3 total qsize 3
val 7.163031 batches/s, 2 starved 4 total qsize 2
train_qsize: 1, val_qsize: 2
......<omit some output here>
val 15.590201 batches/s, 118 starved 216 total qsize 3
val 15.654320 batches/s, 118 starved 217 total qsize 3
val 15.718273 batches/s, 118 starved 218 total qsize 2
train_qsize: 100, val_qsize: 100
00000/00100 1.310306 updates/s 5.256005 loss
val 9.275140 batches/s, 118 starved 219 total qsize 83
00010/00100 0.276424 updates/s 5.061219 loss
val 3.525461 batches/s, 118 starved 220 total qsize 100
00020/00100 0.327731 updates/s 4.658732 loss
val 2.330294 batches/s, 118 starved 221 total qsize 100
00030/00100 0.322194 updates/s 4.153487 loss
val 1.736846 batches/s, 118 starved 222 total qsize 100
00040/00100 0.359319 updates/s 3.952793 loss
val 1.418118 batches/s, 118 starved 223 total qsize 100
00050/00100 0.347661 updates/s 3.769470 loss
val 1.196272 batches/s, 118 starved 224 total qsize 100
00060/00100 0.318332 updates/s 3.534736 loss
val 1.022213 batches/s, 118 starved 225 total qsize 100
00070/00100 0.330162 updates/s 3.429694 loss
val 0.896351 batches/s, 118 starved 226 total qsize 100
00080/00100 0.347375 updates/s 3.273200 loss
val 0.804948 batches/s, 118 starved 227 total qsize 100
00090/00100 0.350296 updates/s 3.312094 loss
val 0.730228 batches/s, 118 starved 228 total qsize 100
Process Process-7:
Traceback (most recent call last):
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
    queue.put(self.get_batch(is_training))
  File "<string>", line 2, in put
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
    kind, result = conn.recv()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Process Process-5:
Process Process-4:
Process Process-6:
Traceback (most recent call last):
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
    queue.put(self.get_batch(is_training))
  File "<string>", line 2, in put
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
    kind, result = conn.recv()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Traceback (most recent call last):
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
    queue.put(self.get_batch(is_training))
  File "<string>", line 2, in put
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
    kind, result = conn.recv()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Traceback (most recent call last):
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/pyxies/code/recurrent-relational-networks/tasks/babi/rrn.py", line 207, in data_loader
    queue.put(self.get_batch(is_training))
  File "<string>", line 2, in put
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/managers.py", line 717, in _callmethod
    kind, result = conn.recv()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/pyxies/anaconda3/envs/py35/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Process finished with exit code 0
Gromy1211 commented 3 years ago

hi, I meet the same error. Could you please tell me how did you solve it?