memray / seq2seq-keyphrase

MIT License
318 stars 109 forks source link

training problem #5

Closed CQRachel closed 7 years ago

CQRachel commented 7 years ago

hi, thanks for answering CopyNet questions in the other issue~ O(∩_∩)O~

I followed ReadMe to train the model and found this ValueError: too many values to unpack. this error is located to [197] train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])

BUT, when I set config['copynet']=False, it passes this error and meets another "MemoryError", the details are as following. Traceback (most recent call last): File "theano/scan_module/scan_perform.pyx", line 397, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.6.2-64/scan_perform/mod.cpp:4490) MemoryError

I don't know if it is related to my environment, which is ‘’GTX GeForce 1070 with Nvidia-375, Anaconda2-4.4, Cuda-8, Cudnn-5.1, Python2.7”, theano_flags are "device=gpu, floatX=float32, nvcc.fastmath=True, nvcc.flags=-D_FORCE_INLINES". I will be grateful if you could share your environment setting.

BTW, how much time and memory will it take to train the model with all_600k_dataset.pkl? Can I try the training process with another smaller dataset? How can I create a smaller dataset?

memray commented 7 years ago

This error means the number of variables doesn't match. I tested line 197 from my side and it worked fine. Pretty weird. It should load this file (dataset/keyphrase/punctuation-20000validation-20000testing/all_600k_dataset.pkl') and unfold 5 variables. Could you set a breakpoint there to see how many variables are returned by deserialize_from_file(config['dataset'])?

If it consumes too much memory and breaks down, I recommend you set a smaller config['mini_mini_batch_length']. I resize each batch according to the text length. It's a stupid trick, but works pretty well for me.

My graphics card is GTX GeForce 980ti with 6gb memory. Other settings look similar to mine.

CQRachel commented 7 years ago

deserialize_from_file(config['dataset']) returns 5 variables. BUT it will shut down during the process. It's really weird.

193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:56:27 [INFO] generic_utils:  193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3240s - ETA: 85380s - loss_reg: 12.9687 - ppl.: 1834.6243
 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:56:42 [INFO] generic_utils:  194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3255s - ETA: 85315s - loss_reg: 12.9582 - ppl.: 1826.3660
 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:03 [INFO] generic_utils:  195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3276s - ETA: 85413s - loss_reg: 12.9530 - ppl.: 1819.6269
 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:17 [INFO] generic_utils:  196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3290s - ETA: 85329s - loss_reg: 12.9422 - ppl.: 1811.8850
 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:35 [INFO] generic_utils:  197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3309s - ETA: 85346s - loss_reg: 12.9285 - ppl.: 1803.5442
 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:57:47 [INFO] generic_utils:  198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3320s - ETA: 85201s - loss_reg: 12.9171 - ppl.: 1795.5123
 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:58:03 [INFO] generic_utils:  199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3336s - ETA: 85163s - loss_reg: 12.9053 - ppl.: 1787.5648
 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/23/2017 17:58:18 [INFO] generic_utils:  200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 3351s - ETA: 85092s - loss_reg: 12.8951 - ppl.: 1779.8260
09/23/2017 17:58:18 [INFO] keyphrase_copynet: Echo=200 Evaluation Sampling.
09/23/2017 17:58:18 [INFO] keyphrase_copynet: generating [training set] samples
09/23/2017 17:58:18 [INFO] covc_encdec:      Depth=0, get 1 outputs
09/23/2017 17:58:19 [INFO] covc_encdec:      Depth=1, get 54 outputs
09/23/2017 17:58:20 [INFO] covc_encdec:      Depth=2, get 201 outputs
09/23/2017 17:58:21 [INFO] covc_encdec:      Depth=3, get 254 outputs
09/23/2017 17:58:22 [INFO] covc_encdec:      Depth=4, get 401 outputs
09/23/2017 17:58:22 [INFO] covc_encdec:      Depth=5, get 454 outputs
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 212, in 
    logger.info('#(training paper)=%d' % len(train_set['source']))
ValueError: too many values to unpack

I tried config['mini_mini_batch_length']=3000, but it seems like it can't be this small. I get this:

09/24/2017 11:54:25 [INFO] keyphrase_copynet: compile ok.
09/24/2017 11:54:25 [INFO] keyphrase_copynet: 
Epoch = 1 -> Training Set Learning...
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 355, in 
    data_c = cc_martix(mini_data_s, mini_data_t)
  File "keyphrase_copynet.py", line 111, in cc_martix
    cc = np.zeros((source.shape[0], target.shape[1], source.shape[1]), dtype='float32')
IndexError: tuple index out of range

Still, I want to simplify the enc-dec model. I know I can try set config['enc_embedd_dim'] etc to a small number. Is 10 ok? right now, I just want a quick training.

CQRachel commented 7 years ago

maybe the training process didn't stop at deserialize_from_file(config['dataset']), but another line? below is the newest attempt.

] - Run-time: 5957s - ETA: 158664s - loss_reg: 11.2822 - ppl.: 1071.5933
 192/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:08 [INFO] generic_utils:  192/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 5993s - ETA: 158761s - loss_reg: 11.2724 - ppl.: 1067.2454
 193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:23 [INFO] generic_utils:  193/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6008s - ETA: 158307s - loss_reg: 11.2619 - ppl.: 1062.7955
 194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:57:42 [INFO] generic_utils:  194/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6028s - ETA: 157977s - loss_reg: 11.2542 - ppl.: 1058.3417
 195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:58:33 [INFO] generic_utils:  195/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6079s - ETA: 158460s - loss_reg: 11.2447 - ppl.: 1054.0008
 196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 13:59:49 [INFO] generic_utils:  196/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6154s - ETA: 159575s - loss_reg: 11.2369 - ppl.: 1049.6457
 197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:07 [INFO] generic_utils:  197/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6172s - ETA: 159205s - loss_reg: 11.2290 - ppl.: 1045.4208
 198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:24 [INFO] generic_utils:  198/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6189s - ETA: 158801s - loss_reg: 11.2233 - ppl.: 1041.6077
 199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:00:52 [INFO] generic_utils:  199/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6218s - ETA: 158704s - loss_reg: 11.2162 - ppl.: 1037.3962
 200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 14:01:12 [INFO] generic_utils:  200/5278 [(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 6237s - ETA: 158368s - loss_reg: 11.2138 - ppl.: 1035.6492
09/24/2017 14:01:12 [INFO] keyphrase_copynet: Echo=200 Evaluation Sampling.
09/24/2017 14:01:12 [INFO] keyphrase_copynet: generating [training set] samples
09/24/2017 14:01:12 [INFO] covc_encdec:      Depth=0, get 0 outputs
09/24/2017 14:01:13 [INFO] covc_encdec:      Depth=1, get 79 outputs
09/24/2017 14:01:14 [INFO] covc_encdec:      Depth=2, get 192 outputs
09/24/2017 14:01:14 [INFO] covc_encdec:      Depth=3, get 279 outputs
09/24/2017 14:01:15 [INFO] covc_encdec:      Depth=4, get 392 outputs
09/24/2017 14:01:16 [INFO] covc_encdec:      Depth=5, get 479 outputs
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 392, in 
    prediction, score = agent.generate_multiple(inputs_unk[None, :], return_all=True)
ValueError: too many values to unpack

The settings are: config['do_train'] = True config['mini_mini_batch_length'] = 50000 config['trained_model'] = '' config['enc_embedd_dim'] = 100 config['enc_hidden_dim'] = 150 config['dec_embedd_dim'] = 100
config['dec_hidden_dim'] = 180
others are quite the same as the original config file.

memray commented 7 years ago

Oops! Line 392 is a bug, and there are several similar errors.

Simply change the return_encoding=True to return_encoding=False in the generate_multiple would solve it. return_encoding=True means to return the vector of output sequence, which I used for visualizing.

Setting config['mini_mini_batch_length']=3000 is too small and could cause infinite loop. I recommend to set it to at least 50000. If you set a too small config['mini_mini_batch_length'] it would slow down the training drastically.

Sorry for my bad code in the training part. Theano often crashes if there are some abnormally long documents. Thus I have to resize the batch according to the document length. Therefore the current process is, firstly split data into mini-batches by the config['batch_size'], then split each mini-batch into mini-mini-batches according to the config['mini_mini_batch_length'], and at last feed mini-mini-batches into optimizer one by one. The batch size only matters for the progress bar now, and I should have removed it.

I have updated my code. Please check out the latest version. Let me know if you find other problems.

Thanks

CQRachel commented 7 years ago

Thanks, I am going to try the new one. Here is another result that may need to be looked into. Yesterday, I found that error may locate at the quick_testing part, so I set config['do_quick_testing'] = False, yet I got this.

 987/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:33:44 [INFO] generic_utils:  987/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29429s - ETA: 127943s - loss_reg: 9.7971 - ppl.: 396.1523
 988/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:34:01 [INFO] generic_utils:  988/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29445s - ETA: 127856s - loss_reg: 9.7964 - ppl.: 395.9536
 989/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:34:43 [INFO] generic_utils:  989/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29487s - ETA: 127878s - loss_reg: 9.7956 - ppl.: 395.7189
 990/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:35:23 [INFO] generic_utils:  990/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29527s - ETA: 127893s - loss_reg: 9.7954 - ppl.: 398.0612
 991/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:35:44 [INFO] generic_utils:  991/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29548s - ETA: 127825s - loss_reg: 9.7940 - ppl.: 397.7827
 992/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:36:25 [INFO] generic_utils:  992/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29589s - ETA: 127845s - loss_reg: 9.7936 - ppl.: 397.5611
 993/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:36:50 [INFO] generic_utils:  993/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29615s - ETA: 127796s - loss_reg: 9.7929 - ppl.: 397.4056
 994/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:37:19 [INFO] generic_utils:  994/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29644s - ETA: 127762s - loss_reg: 9.7926 - ppl.: 397.2352
 995/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:38:57 [INFO] generic_utils:  995/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29741s - ETA: 128022s - loss_reg: 9.7919 - ppl.: 397.0614
 996/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:39:40 [INFO] generic_utils:  996/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29785s - ETA: 128052s - loss_reg: 9.7913 - ppl.: 396.8329
 997/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:40:11 [INFO] generic_utils:  997/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29815s - ETA: 128023s - loss_reg: 9.7902 - ppl.: 396.5831
 998/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:40:29 [INFO] generic_utils:  998/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29834s - ETA: 127946s - loss_reg: 9.7889 - ppl.: 396.3332
 999/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:41:01 [INFO] generic_utils:  999/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29865s - ETA: 127922s - loss_reg: 9.7884 - ppl.: 396.1288
1000/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~]09/24/2017 23:41:15 [INFO] generic_utils: 1000/5278 [....(-w-)~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 29879s - ETA: 127824s - loss_reg: 9.7875 - ppl.: 395.9283
09/24/2017 23:41:15 [INFO] keyphrase_copynet: Validate @ epoch=1, batch=1000
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 454, in 
    data_c = cc_martix(mini_data_s, mini_data_t)
  File "keyphrase_copynet.py", line 111, in cc_martix
    cc = np.zeros((source.shape[0], target.shape[1], source.shape[1]), dtype='float32')
IndexError: tuple index out of range
memray commented 7 years ago

That's the consequence of small mini_mini_batch_length. If it's smaller than len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]), the program won't go into while body and therefore mini_data_s and mini_data_t are empty.

CQRachel commented 7 years ago

but the above is the result of setting config['mini_mini_batch_length'] = 50000 I thought mini_mini_batch_length should be at least larger than voc_size, so I set it 50000 and conducted the test.

CQRachel commented 7 years ago

I tried the latest code a few times. First, I change config file with setting config['do_train']=True, besides, in order to simplify the model (trying to save time), I set _bidirectional=False, smaller enc/dec_embedd/hiddendim, no other changes. Got this:

09/25/2017 10:57:14 [INFO] covc_encdec: Precision=0.1000, Recall=1.0000, F1=0.1818

**************************************************
09/25/2017 10:57:14 [INFO] keyphrase_copynet: Validate @ epoch=1, batch=1000
     100 / 106657
     200 / 106657
...
     106200 / 106657
     106300 / 106657
     106400 / 106657
     106500 / 106657
     106600 / 106657
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 476, in 
    mean_ll = np.average([l[0] for l in loss_valid])
  File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1110, in average
    avg = a.mean(axis)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 70, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
ValueError: operands could not be broadcast together with shapes (5,) (2,) 

Then, I only change config file with setting config['do_train']=True, all the others are kept the same with the original files. Got this:

09/25/2017 14:32:06 [INFO] keyphrase_copynet: Training minibatch 99/256
09/25/2017 14:32:08 [INFO] keyphrase_copynet: Training minibatch 198/256
09/25/2017 14:32:10 [INFO] keyphrase_copynet: Training minibatch 256/256
   37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/25/2017 14:32:11 [INFO] generic_utils:    37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 231s - ETA: 65818s - loss_reg: 24.7431 - ppl.: 22108.8071
09/25/2017 14:32:11 [INFO] keyphrase_copynet: Training minibatch 206/331
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 363, in 
    loss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)]
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 951, in rval
    r = p(n, [x[0] for x in i], o)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 940, in 
    self, node)
  File "theano/scan_module/scan_perform.pyx", line 524, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/scan_perform/mod.cpp:5853)
RuntimeError: CudaNdarray_ZEROS: allocation failed.
...
...
 TotalSize: 3577892305.0 Byte(s) 3.332 GB
 TotalSize inputs: 948240772.0 Byte(s) 0.883 GB

I don't know the real reason that causes the above result. Guessing it may be caused by that I am using theano-0.8.2, so I change it into theano-0.9.0. Got this:

09/25/2017 16:42:37 [INFO] covc_encdec: compiling the compuational graph ::training function::
ERROR (theano.gof.opt): SeqOptimizer apply 
09/25/2017 16:43:39 [ERROR] opt: SeqOptimizer apply 
ERROR (theano.gof.opt): Traceback:
09/25/2017 16:43:39 [ERROR] opt: Traceback:
ERROR (theano.gof.opt): Traceback (most recent call last):
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 235, in apply
    sub_prof = optimizer.optimize(fgraph)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 87, in optimize
    ret = self.apply(fgraph, *args, **kwargs)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 685, in apply
    node = self.process_node(fgraph, node)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 745, in process_node
    node, args)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 854, in push_out_inner_vars
    add_as_nitsots)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_opt.py", line 906, in add_nitsot_outputs
    reason='scanOp_pushout_output')
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 391, in replace_all_validate_remove
    chk = fgraph.replace_all_validate(replacements, reason)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 365, in replace_all_validate
    fgraph.validate()
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 256, in validate_
    ret = fgraph.execute_callbacks('validate')
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/fg.py", line 589, in execute_callbacks
    fn(self, *args, **kwargs)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/toolbox.py", line 422, in validate
    raise theano.gof.InconsistencyError("Trying to reintroduce a removed node")
InconsistencyError: Trying to reintroduce a removed node

but it kept running, and finally got the same error with the first one: o(╯□╰)o

106100 / 106657
     106200 / 106657
     106300 / 106657
     106400 / 106657
     106500 / 106657
     106600 / 106657
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 476, in 
    mean_ll = np.average([l[0] for l in loss_valid])
  File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1110, in average
    avg = a.mean(axis)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py", line 70, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
ValueError: operands could not be broadcast together with shapes (5,) (2,) 

sorry for this long comment.

memray commented 7 years ago

Hi, please replace the line 476 and 477 to this: mean_ll = np.average(np.concatenate([l[0] for l in loss_batch])) mean_ppl = np.average(np.concatenate([l[1] for l in loss_batch])) Sorry about it. I rarely run the validation thus didn't capture the error. And you'd better reduce the size of validation data (perhaps 1,000 is enough, now is 20,000) or it consumes too much time.

CQRachel commented 7 years ago

I think it should be loss_valid. mean_ll = np.average(np.concatenate([l[0] for l in loss_valid])) mean_ppl = np.average(np.concatenate([l[1] for l in loss_valid]))

memray commented 7 years ago

Yes, exactly. Does it help?

CQRachel commented 7 years ago

It runs OK now. I change config file like this:

config['bidirectional']   = False
config['enc_embedd_dim']  = 100#150    
config['enc_hidden_dim']  = 150#300
config['dec_embedd_dim']  = 100#150 
config['dec_hidden_dim']  = 180#300

It took 25 hours to train this model, this process is fun though. start: 09/26/2017 16:56:57 end: 09/27/2017 20:07:18 Thanks.

BUT, there are still two problems. Case 1, setting the above back to the original settings. It ends with:

09/28/2017 09:59:16 [INFO] keyphrase_copynet: Training minibatch 256/256
   37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~]09/28/2017 09:59:17 [INFO] generic_utils:    37/10556 [~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] - Run-time: 238s - ETA: 67868s - loss_reg: 24.7432 - ppl.: 22110.4056
09/28/2017 09:59:17 [INFO] keyphrase_copynet: Training minibatch 206/331
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 363, in 
    loss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)]
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 951, in rval
    r = p(n, [x[0] for x in i], o)
  File "/home/cc/anaconda2/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 940, in 
    self, node)
  File "theano/scan_module/scan_perform.pyx", line 524, in theano.scan_module.scan_perform.perform (/home/cc/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/scan_perform/mod.cpp:5853)
RuntimeError: CudaNdarray_ZEROS: allocation failed.
Apply node that caused the error: forall_inplace,gpu,grad_of_scan_fn&grad_of_scan_fn}(Elemwise{Composite{minimum(minimum(minimum(i0, i1), i2), i2)}}.0, ...
...
 - TensorConstant{1}, Shape: (), ElemSize: 1 Byte(s), TotalSize: 1.0 Byte(s)
 - TensorConstant{0}, Shape: (), ElemSize: 1 Byte(s), TotalSize: 1.0 Byte(s)
 TotalSize: 3577892305.0 Byte(s) 3.332 GB
 TotalSize inputs: 948240772.0 Byte(s) 0.883 GB

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.

Case 2, setting config['copynet'] = False. It ends with:

09/26/2017 16:54:31 [INFO] encdec: sampling functions compile done.
09/26/2017 16:54:31 [INFO] keyphrase_copynet: compile ok.
09/26/2017 16:54:31 [INFO] keyphrase_copynet: 
Epoch = 1 -> Training Set Learning...
09/26/2017 16:54:31 [INFO] keyphrase_copynet: Training minibatch 114/255
09/26/2017 16:54:32 [INFO] keyphrase_copynet: Training minibatch 228/255
09/26/2017 16:54:33 [INFO] keyphrase_copynet: Training minibatch 255/255
Traceback (most recent call last):
  File "keyphrase_copynet.py", line 373, in 
    mean_ll  = np.average(np.concatenate([l[0] for l in loss_batch]))
ValueError: zero-dimensional arrays cannot be concatenated
memray commented 7 years ago

The first one may because you are requesting too much memory? What's the difference between your current setting and the original one? If the only difference is about mini_mini_batch_length, then it would be the case. For the normal RNN (without Copying ) I fixed the bug in loss function. Please check out the latest code.

CQRachel commented 7 years ago

The first case looks like the memory problem. The result is the outcome of the original 'config.py' setting. I changed nothing. The code runs OK when training a smaller model (the result of 25 hours). The changes are as below. In this successful case, mini_mini_batch_length is still 300000. So I think mini_mini_batch_length may not be my real problem.

config['bidirectional']   = False
config['enc_embedd_dim']  = 100#150    
config['enc_hidden_dim']  = 150#300
config['dec_embedd_dim']  = 100#150 
config['dec_hidden_dim']  = 180#300

BUT, It is supposed to be able to handle the original setting because my environment is 8gb memory. I don't know what's wrong.

memray commented 7 years ago

Sorry, I'm not quite sure what the reason is. You mean you reduce the size of model but it causes memory problem?