Training stops after some time

emreorta commented 6 years ago

Hello everyone,

I've been trying to train a model with different num_heads, hidden and num_steps parameters. The default parameters in config.py works like a charm but once I change the mentioned parameters, I get this:

Exception ignored in: <bound method tqdm.__del__ of  42%|██████████████████████▉                                | 49999/120000 [15:34:24<18:06:29,  1.07it/s]>
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [02:05<00:00,  2.53it/s]
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 889, in __del__
    self.close()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1095, in close
    self._decr_instances(self)
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
    cls.monitor.exit()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/usr/lib/python3.5/threading.py", line 1051, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

This occured when I set num_head to 2, 4 and 8. I could train up to 50k and 54k steps when num_head was set to 2 and 4, and it failed from the starts when num_head was set to 8.

I'm using Ubuntu 16.04, Python 3.5.2 and training the network on a GPU. Here's the nvidia-smi and nvcc --version output if someone needs it:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   72C    P0    63W / 149W |      0MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

So what could be the real cause of this error?

Thanks in advance!

theSage21 commented 6 years ago

Is this a partial traceback? That particular errors usually has more along with it.

emreorta commented 6 years ago

I copied everything when the error occurred as far as I remember.

I'll train the network and try to replicate the error again so that I could post the full traceback to see if there's something missing. I'll let you know once I get the result.

emreorta commented 6 years ago

@theSage21 I have the result now. I ran python config.py --mode train and got this:

Building model...
WARNING:tensorflow:From /home/username/folder_name/QANet/layers.py:54: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/username/folder_name/QANet/model.py:134: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:From /home/username/folder_name/QANet/model.py:174: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Total number of trainable parameters: 1295553
2018-10-12 15:23:21.789612: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-12 15:23:24.608200: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
NUMA node zero
2018-10-12 15:23:24.608623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-10-12 15:23:24.608677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-12 15:23:35.135945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 15:23:35.135982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-12 15:23:35.135993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-12 15:23:35.136271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
  0%|                                                                                                                                                                                   | 0/120000 [00:00<?,
?it/s]2018-10-12 15:24:25.500978: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 6682 of 15000
2018-10-12 15:24:35.501937: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 14883 of 15000
2018-10-12 15:24:35.631463: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
Exception ignored in: <bound method tqdm.__del__ of  46%|█████████████████████████████████████████████████████████████████████████▊                                                                                       | 54999/120000 [17:08:45<17:58:22,  1.00it/s]
> ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [02:05<00:00,  2.39it/s]
Traceback (most recent call last):
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 889, in __del__
    self.close()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1095, in close
    self._decr_instances(self)
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
    cls.monitor.exit()
  File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/usr/lib/python3.5/threading.py", line 1051, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Let me know if you need anything else and thanks in advance!

theSage21 commented 6 years ago

I don't know. Exception ignored in: seems suspicious but otherwise I can't tell. Maybe if you remove tqdm in that particular line and let it fail the traceback may be different. I recollect tqdm used to have problems with multiprocessing code.

hezhihao10 commented 6 years ago

I have the same problem while i try to run the code. Have you solve it?

emreorta commented 6 years ago

I couldn't find time to work on it as I've been quite busy at work due to the unexpected tasks. But once I find a solution, I'll post it here.

localminimum commented 6 years ago

Hi, which version of tqdm are you currently using? It could be related to the version problem according to this issue. Please try reverting the tqdm version back to 4.19 or even lower and try again? Thanks!

emreorta commented 6 years ago

Hello,

I'm using tqdm v4.26.0 right now. I'll do what you say and report back as soon as I get the results.

Thanks for the answer!

emreorta commented 6 years ago

I've trained several networks using different configurations and seems like downgrading tqdm fixed the problem. Thanks @localminimum!

localminimum / QANet

Training stops after some time #47