Closed CQRachel closed 7 years ago
This error means the number of variables doesn't match. I tested line 197 from my side and it worked fine. Pretty weird. It should load this file (dataset/keyphrase/punctuation-20000validation-20000testing/all_600k_dataset.pkl') and unfold 5 variables. Could you set a breakpoint there to see how many variables are returned by deserialize_from_file(config['dataset'])?
If it consumes too much memory and breaks down, I recommend you set a smaller config['mini_mini_batch_length']. I resize each batch according to the text length. It's a stupid trick, but works pretty well for me.
My graphics card is GTX GeForce 980ti with 6gb memory. Other settings look similar to mine.
deserialize_from_file(config['dataset']) returns 5 variables. BUT it will shut down during the process. It's really weird.
193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:56:27 [INFO] generic_utils: 193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3240s - ETA: 85380s - loss_reg: 12.9687 - ppl.: 1834.6243 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:56:42 [INFO] generic_utils: 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3255s - ETA: 85315s - loss_reg: 12.9582 - ppl.: 1826.3660 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:03 [INFO] generic_utils: 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3276s - ETA: 85413s - loss_reg: 12.9530 - ppl.: 1819.6269 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:17 [INFO] generic_utils: 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3290s - ETA: 85329s - loss_reg: 12.9422 - ppl.: 1811.8850 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:35 [INFO] generic_utils: 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3309s - ETA: 85346s - loss_reg: 12.9285 - ppl.: 1803.5442 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:47 [INFO] generic_utils: 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3320s - ETA: 85201s - loss_reg: 12.9171 - ppl.: 1795.5123 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:58:03 [INFO] generic_utils: 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3336s - ETA: 85163s - loss_reg: 12.9053 - ppl.: 1787.5648 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:58:18 [INFO] generic_utils: 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3351s - ETA: 85092s - loss_reg: 12.8951 - ppl.: 1779.8260 09/23/2017 17:58:18 [INFO] keyphrase_copynet: Echo=200 Evaluation Sampling. 09/23/2017 17:58:18 [INFO] keyphrase_copynet: generating [training set] samples 09/23/2017 17:58:18 [INFO] covc_encdec: Depth=0, get 1 outputs 09/23/2017 17:58:19 [INFO] covc_encdec: Depth=1, get 54 outputs 09/23/2017 17:58:20 [INFO] covc_encdec: Depth=2, get 201 outputs 09/23/2017 17:58:21 [INFO] covc_encdec: Depth=3, get 254 outputs 09/23/2017 17:58:22 [INFO] covc_encdec: Depth=4, get 401 outputs 09/23/2017 17:58:22 [INFO] covc_encdec: Depth=5, get 454 outputs Traceback (most recent call last): File "keyphrase_copynet.py", line 212, inlogger.info('#(training paper)=%d' % len(train_set['source'])) ValueError: too many values to unpack
I tried config['mini_mini_batch_length']=3000, but it seems like it can't be this small. I get this:
09/24/2017 11:54:25 [INFO] keyphrase_copynet: compile ok. 09/24/2017 11:54:25 [INFO] keyphrase_copynet: Epoch = 1 -> Training Set Learning... Traceback (most recent call last): File "keyphrase_copynet.py", line 355, indata_c = cc_martix(mini_data_s, mini_data_t) File "keyphrase_copynet.py", line 111, in cc_martix cc = np.zeros((source.shape[0], target.shape[1], source.shape[1]), dtype='float32') IndexError: tuple index out of range
Still, I want to simplify the enc-dec model. I know I can try set config['enc_embedd_dim'] etc to a small number. Is 10 ok? right now, I just want a quick training.
maybe the training process didn't stop at deserialize_from_file(config['dataset']), but another line? below is the newest attempt.
] - Run-time: 5957s - ETA: 158664s - loss_reg: 11.2822 - ppl.: 1071.5933 192/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:08 [INFO] generic_utils: 192/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 5993s - ETA: 158761s - loss_reg: 11.2724 - ppl.: 1067.2454 193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:23 [INFO] generic_utils: 193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6008s - ETA: 158307s - loss_reg: 11.2619 - ppl.: 1062.7955 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:42 [INFO] generic_utils: 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6028s - ETA: 157977s - loss_reg: 11.2542 - ppl.: 1058.3417 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:58:33 [INFO] generic_utils: 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6079s - ETA: 158460s - loss_reg: 11.2447 - ppl.: 1054.0008 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:59:49 [INFO] generic_utils: 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6154s - ETA: 159575s - loss_reg: 11.2369 - ppl.: 1049.6457 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:07 [INFO] generic_utils: 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6172s - ETA: 159205s - loss_reg: 11.2290 - ppl.: 1045.4208 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:24 [INFO] generic_utils: 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6189s - ETA: 158801s - loss_reg: 11.2233 - ppl.: 1041.6077 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:52 [INFO] generic_utils: 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6218s - ETA: 158704s - loss_reg: 11.2162 - ppl.: 1037.3962 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:01:12 [INFO] generic_utils: 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6237s - ETA: 158368s - loss_reg: 11.2138 - ppl.: 1035.6492 09/24/2017 14:01:12 [INFO] keyphrase_copynet: Echo=200 Evaluation Sampling. 09/24/2017 14:01:12 [INFO] keyphrase_copynet: generating [training set] samples 09/24/2017 14:01:12 [INFO] covc_encdec: Depth=0, get 0 outputs 09/24/2017 14:01:13 [INFO] covc_encdec: Depth=1, get 79 outputs 09/24/2017 14:01:14 [INFO] covc_encdec: Depth=2, get 192 outputs 09/24/2017 14:01:14 [INFO] covc_encdec: Depth=3, get 279 outputs 09/24/2017 14:01:15 [INFO] covc_encdec: Depth=4, get 392 outputs 09/24/2017 14:01:16 [INFO] covc_encdec: Depth=5, get 479 outputs Traceback (most recent call last): File "keyphrase_copynet.py", line 392, inprediction, score = agent.generate_multiple(inputs_unk[None, :], return_all=True) ValueError: too many values to unpack
The settings are:
config['do_train'] = True
config['mini_mini_batch_length'] = 50000
config['trained_model'] = ''
config['enc_embedd_dim'] = 100
config['enc_hidden_dim'] = 150
config['dec_embedd_dim'] = 100
config['dec_hidden_dim'] = 180
others are quite the same as the original config file.
Oops! Line 392 is a bug, and there are several similar errors.
Simply change the return_encoding=True
to return_encoding=False
in the generate_multiple would solve it. return_encoding=True
means to return the vector of output sequence, which I used for visualizing.
Setting config['mini_mini_batch_length']=3000 is too small and could cause infinite loop. I recommend to set it to at least 50000. If you set a too small config['mini_mini_batch_length'] it would slow down the training drastically.
Sorry for my bad code in the training part. Theano often crashes if there are some abnormally long documents. Thus I have to resize the batch according to the document length. Therefore the current process is, firstly split data into mini-batches by the config['batch_size'], then split each mini-batch into mini-mini-batches according to the config['mini_mini_batch_length'], and at last feed mini-mini-batches into optimizer one by one. The batch size only matters for the progress bar now, and I should have removed it.
I have updated my code. Please check out the latest version. Let me know if you find other problems.
Thanks
Thanks, I am going to try the new one. Here is another result that may need to be looked into. Yesterday, I found that error may locate at the quick_testing part, so I set config['do_quick_testing'] = False, yet I got this.
987/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:33:44 [INFO] generic_utils: 987/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29429s - ETA: 127943s - loss_reg: 9.7971 - ppl.: 396.1523 988/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:34:01 [INFO] generic_utils: 988/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29445s - ETA: 127856s - loss_reg: 9.7964 - ppl.: 395.9536 989/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:34:43 [INFO] generic_utils: 989/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29487s - ETA: 127878s - loss_reg: 9.7956 - ppl.: 395.7189 990/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:35:23 [INFO] generic_utils: 990/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29527s - ETA: 127893s - loss_reg: 9.7954 - ppl.: 398.0612 991/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:35:44 [INFO] generic_utils: 991/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29548s - ETA: 127825s - loss_reg: 9.7940 - ppl.: 397.7827 992/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:36:25 [INFO] generic_utils: 992/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29589s - ETA: 127845s - loss_reg: 9.7936 - ppl.: 397.5611 993/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:36:50 [INFO] generic_utils: 993/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29615s - ETA: 127796s - loss_reg: 9.7929 - ppl.: 397.4056 994/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:37:19 [INFO] generic_utils: 994/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29644s - ETA: 127762s - loss_reg: 9.7926 - ppl.: 397.2352 995/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:38:57 [INFO] generic_utils: 995/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29741s - ETA: 128022s - loss_reg: 9.7919 - ppl.: 397.0614 996/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:39:40 [INFO] generic_utils: 996/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29785s - ETA: 128052s - loss_reg: 9.7913 - ppl.: 396.8329 997/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:40:11 [INFO] generic_utils: 997/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29815s - ETA: 128023s - loss_reg: 9.7902 - ppl.: 396.5831 998/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:40:29 [INFO] generic_utils: 998/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29834s - ETA: 127946s - loss_reg: 9.7889 - ppl.: 396.3332 999/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:41:01 [INFO] generic_utils: 999/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29865s - ETA: 127922s - loss_reg: 9.7884 - ppl.: 396.1288 1000/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:41:15 [INFO] generic_utils: 1000/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29879s - ETA: 127824s - loss_reg: 9.7875 - ppl.: 395.9283 09/24/2017 23:41:15 [INFO] keyphrase_copynet: Validate @ epoch=1, batch=1000 Traceback (most recent call last): File "keyphrase_copynet.py", line 454, indata_c = cc_martix(mini_data_s, mini_data_t) File "keyphrase_copynet.py", line 111, in cc_martix cc = np.zeros((source.shape[0], target.shape[1], source.shape[1]), dtype='float32') IndexError: tuple index out of range
That's the consequence of small mini_mini_batch_length. If it's smaller than len(data_s[mini_data_idx]) * len(data_t[mini_data_idx])
, the program won't go into while
body and therefore mini_data_s and mini_data_t are empty.
but the above is the result of setting config['mini_mini_batch_length'] = 50000 I thought mini_mini_batch_length should be at least larger than voc_size, so I set it 50000 and conducted the test.
I tried the latest code a few times. First, I change config file with setting config['do_train']=True, besides, in order to simplify the model (trying to save time), I set _bidirectional=False, smaller enc/dec_embedd/hiddendim, no other changes. Got this:
09/25/2017 10:57:14 [INFO] covc_encdec: Precision=0.1000, Recall=1.0000, F1=0.1818 ************************************************** 09/25/2017 10:57:14 [INFO] keyphrase_copynet: Validate @ epoch=1, batch=1000 100 / 106657 200 / 106657 ... 106200 / 106657 106300 / 106657 106400 / 106657 106500 / 106657 106600 / 106657 Traceback (most recent call last): File "keyphrase_copynet.py", line 476, inmean_ll = np.average([l[0] for l in loss_valid]) File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1110, in average avg = a.mean(axis) File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 70, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) ValueError: operands could not be broadcast together with shapes (5,) (2,)
Then, I only change config file with setting config['do_train']=True, all the others are kept the same with the original files. Got this:
09/25/2017 14:32:06 [INFO] keyphrase_copynet: Training minibatch 99/256 09/25/2017 14:32:08 [INFO] keyphrase_copynet: Training minibatch 198/256 09/25/2017 14:32:10 [INFO] keyphrase_copynet: Training minibatch 256/256 37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/25/2017 14:32:11 [INFO] generic_utils: 37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 231s - ETA: 65818s - loss_reg: 24.7431 - ppl.: 22108.8071 09/25/2017 14:32:11 [INFO] keyphrase_copynet: Training minibatch 206/331 Traceback (most recent call last): File "keyphrase_copynet.py", line 363, inloss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)] File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 871, in __call__ storage_map=getattr(self.fn, 'storage_map', None)) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 314, in raise_with_op reraise(exc_type, exc_value, exc_trace) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 859, in __call__ outputs = self.fn() File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 951, in rval r = p(n, [x[0] for x in i], o) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 940, in self, node) File "theano/scan_module/scan_perform.pyx", line 524, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/scan_perform/mod.cpp:5853) RuntimeError: CudaNdarray_ZEROS: allocation failed. ... ... TotalSize: 3577892305.0 Byte(s) 3.332 GB TotalSize inputs: 948240772.0 Byte(s) 0.883 GB
I don't know the real reason that causes the above result. Guessing it may be caused by that I am using theano-0.8.2, so I change it into theano-0.9.0. Got this:
09/25/2017 16:42:37 [INFO] covc_encdec: compiling the compuational graph ::training function:: ERROR (theano.gof.opt): SeqOptimizer apply09/25/2017 16:43:39 [ERROR] opt: SeqOptimizer apply ERROR (theano.gof.opt): Traceback: 09/25/2017 16:43:39 [ERROR] opt: Traceback: ERROR (theano.gof.opt): Traceback (most recent call last): File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 235, in apply sub_prof = optimizer.optimize(fgraph) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 87, in optimize ret = self.apply(fgraph, *args, **kwargs) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 685, in apply node = self.process_node(fgraph, node) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 745, in process_node node, args) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 854, in push_out_inner_vars add_as_nitsots) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 906, in add_nitsot_outputs reason='scanOp_pushout_output') File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 391, in replace_all_validate_remove chk = fgraph.replace_all_validate(replacements, reason) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 365, in replace_all_validate fgraph.validate() File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 256, in validate_ ret = fgraph.execute_callbacks('validate') File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/fg.py", line 589, in execute_callbacks fn(self, *args, **kwargs) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 422, in validate raise theano.gof.InconsistencyError("Trying to reintroduce a removed node") InconsistencyError: Trying to reintroduce a removed node
but it kept running, and finally got the same error with the first one: o(╯□╰)o
106100 / 106657 106200 / 106657 106300 / 106657 106400 / 106657 106500 / 106657 106600 / 106657 Traceback (most recent call last): File "keyphrase_copynet.py", line 476, inmean_ll = np.average([l[0] for l in loss_valid]) File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1110, in average avg = a.mean(axis) File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 70, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) ValueError: operands could not be broadcast together with shapes (5,) (2,)
sorry for this long comment.
Hi, please replace the line 476 and 477 to this: mean_ll = np.average(np.concatenate([l[0] for l in loss_batch])) mean_ppl = np.average(np.concatenate([l[1] for l in loss_batch])) Sorry about it. I rarely run the validation thus didn't capture the error. And you'd better reduce the size of validation data (perhaps 1,000 is enough, now is 20,000) or it consumes too much time.
I think it should be loss_valid. mean_ll = np.average(np.concatenate([l[0] for l in loss_valid])) mean_ppl = np.average(np.concatenate([l[1] for l in loss_valid]))
Yes, exactly. Does it help?
It runs OK now. I change config file like this:
config['bidirectional'] = False config['enc_embedd_dim'] = 100#150 config['enc_hidden_dim'] = 150#300 config['dec_embedd_dim'] = 100#150 config['dec_hidden_dim'] = 180#300
It took 25 hours to train this model, this process is fun though. start: 09/26/2017 16:56:57 end: 09/27/2017 20:07:18 Thanks.
BUT, there are still two problems. Case 1, setting the above back to the original settings. It ends with:
09/28/2017 09:59:16 [INFO] keyphrase_copynet: Training minibatch 256/256 37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/28/2017 09:59:17 [INFO] generic_utils: 37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 238s - ETA: 67868s - loss_reg: 24.7432 - ppl.: 22110.4056 09/28/2017 09:59:17 [INFO] keyphrase_copynet: Training minibatch 206/331 Traceback (most recent call last): File "keyphrase_copynet.py", line 363, inloss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)] File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 871, in __call__ storage_map=getattr(self.fn, 'storage_map', None)) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 314, in raise_with_op reraise(exc_type, exc_value, exc_trace) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 859, in __call__ outputs = self.fn() File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 951, in rval r = p(n, [x[0] for x in i], o) File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 940, in self, node) File "theano/scan_module/scan_perform.pyx", line 524, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/scan_perform/mod.cpp:5853) RuntimeError: CudaNdarray_ZEROS: allocation failed. Apply node that caused the error: forall_inplace,gpu,grad_of_scan_fn&grad_of_scan_fn}(Elemwise{Composite{minimum(minimum(minimum(i0, i1), i2), i2)}}.0, ... ... - TensorConstant{1}, Shape: (), ElemSize: 1 Byte(s), TotalSize: 1.0 Byte(s) - TensorConstant{0}, Shape: (), ElemSize: 1 Byte(s), TotalSize: 1.0 Byte(s) TotalSize: 3577892305.0 Byte(s) 3.332 GB TotalSize inputs: 948240772.0 Byte(s) 0.883 GB HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
Case 2, setting config['copynet'] = False. It ends with:
09/26/2017 16:54:31 [INFO] encdec: sampling functions compile done. 09/26/2017 16:54:31 [INFO] keyphrase_copynet: compile ok. 09/26/2017 16:54:31 [INFO] keyphrase_copynet: Epoch = 1 -> Training Set Learning... 09/26/2017 16:54:31 [INFO] keyphrase_copynet: Training minibatch 114/255 09/26/2017 16:54:32 [INFO] keyphrase_copynet: Training minibatch 228/255 09/26/2017 16:54:33 [INFO] keyphrase_copynet: Training minibatch 255/255 Traceback (most recent call last): File "keyphrase_copynet.py", line 373, inmean_ll = np.average(np.concatenate([l[0] for l in loss_batch])) ValueError: zero-dimensional arrays cannot be concatenated
The first one may because you are requesting too much memory? What's the difference between your current setting and the original one? If the only difference is about mini_mini_batch_length, then it would be the case. For the normal RNN (without Copying ) I fixed the bug in loss function. Please check out the latest code.
The first case looks like the memory problem. The result is the outcome of the original 'config.py' setting. I changed nothing. The code runs OK when training a smaller model (the result of 25 hours). The changes are as below. In this successful case, mini_mini_batch_length is still 300000. So I think mini_mini_batch_length may not be my real problem.
config['bidirectional'] = False config['enc_embedd_dim'] = 100#150 config['enc_hidden_dim'] = 150#300 config['dec_embedd_dim'] = 100#150 config['dec_hidden_dim'] = 180#300
BUT, It is supposed to be able to handle the original setting because my environment is 8gb memory. I don't know what's wrong.
Sorry, I'm not quite sure what the reason is. You mean you reduce the size of model but it causes memory problem?
hi, thanks for answering CopyNet questions in the other issue~ O(∩_∩)O~
I followed ReadMe to train the model and found this
ValueError: too many values to unpack.
this error is located to[197] train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])
BUT, when I set config['copynet']=False, it passes this error and meets another "MemoryError", the details are as following.
Traceback (most recent call last): File "theano/scan_module/scan_perform.pyx", line 397, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.2-64/scan_perform/mod.cpp:4490) MemoryError
I don't know if it is related to my environment, which is ‘’GTX GeForce 1070 with Nvidia-375, Anaconda2-4.4, Cuda-8, Cudnn-5.1, Python2.7”, theano_flags are "device=gpu, floatX=float32, nvcc.fastmath=True, nvcc.flags=-D_FORCE_INLINES". I will be grateful if you could share your environment setting.
BTW, how much time and memory will it take to train the model with all_600k_dataset.pkl? Can I try the training process with another smaller dataset? How can I create a smaller dataset?