Closed emreorta closed 6 years ago
Is this a partial traceback? That particular errors usually has more along with it.
I copied everything when the error occurred as far as I remember.
I'll train the network and try to replicate the error again so that I could post the full traceback to see if there's something missing. I'll let you know once I get the result.
@theSage21 I have the result now. I ran python config.py --mode train
and got this:
Building model...
WARNING:tensorflow:From /home/username/folder_name/QANet/layers.py:54: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/username/folder_name/QANet/model.py:134: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:From /home/username/folder_name/QANet/model.py:174: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
Total number of trainable parameters: 1295553
2018-10-12 15:23:21.789612: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-12 15:23:24.608200: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
NUMA node zero
2018-10-12 15:23:24.608623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-10-12 15:23:24.608677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-12 15:23:35.135945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 15:23:35.135982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-10-12 15:23:35.135993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-10-12 15:23:35.136271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
0%| | 0/120000 [00:00<?,
?it/s]2018-10-12 15:24:25.500978: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 6682 of 15000
2018-10-12 15:24:35.501937: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 14883 of 15000
2018-10-12 15:24:35.631463: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
Exception ignored in: <bound method tqdm.__del__ of 46%|█████████████████████████████████████████████████████████████████████████▊ | 54999/120000 [17:08:45<17:58:22, 1.00it/s]
> ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 328/328 [02:05<00:00, 2.39it/s]
Traceback (most recent call last):
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 889, in __del__
self.close()
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1095, in close
self._decr_instances(self)
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
cls.monitor.exit()
File "/home/username/.virtualenvs/qanet/lib/python3.5/site-packages/tqdm/_monitor.py", line 52, in exit
self.join()
File "/usr/lib/python3.5/threading.py", line 1051, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
Let me know if you need anything else and thanks in advance!
I don't know. Exception ignored in:
seems suspicious but otherwise I can't tell. Maybe if you remove tqdm in that particular line and let it fail the traceback may be different. I recollect tqdm used to have problems with multiprocessing code.
I have the same problem while i try to run the code. Have you solve it?
I couldn't find time to work on it as I've been quite busy at work due to the unexpected tasks. But once I find a solution, I'll post it here.
Hi, which version of tqdm are you currently using? It could be related to the version problem according to this issue. Please try reverting the tqdm version back to 4.19 or even lower and try again? Thanks!
Hello,
I'm using tqdm v4.26.0 right now. I'll do what you say and report back as soon as I get the results.
Thanks for the answer!
I've trained several networks using different configurations and seems like downgrading tqdm
fixed the problem. Thanks @localminimum!
Hello everyone,
I've been trying to train a model with different
num_heads
,hidden
andnum_steps
parameters. The default parameters inconfig.py
works like a charm but once I change the mentioned parameters, I get this:This occured when I set
num_head
to 2, 4 and 8. I could train up to 50k and 54k steps whennum_head
was set to 2 and 4, and it failed from the starts whennum_head
was set to 8.I'm using Ubuntu 16.04, Python 3.5.2 and training the network on a GPU. Here's the
nvidia-smi
andnvcc --version
output if someone needs it:So what could be the real cause of this error?
Thanks in advance!