lovecambi / qebrain

machine translation and quality estimation
BSD 2-Clause "Simplified" License
34 stars 18 forks source link

ResaurceExhaustedError on qe_train #3

Closed dimitarsh1 closed 5 years ago

dimitarsh1 commented 5 years ago

Hello,

I have trained several expert models (different values for the hyperparams) and while that succeeds for each case, I keep on having problems training the qe model.

I have trained models with 120k of vocabulary, 75k vocabulary and 50k vocabulary. Different batch sizes (with the smallest being 20). I always get the following error:


  load pretrained expert weights for infer model, time 3.99s
# External evaluation, global step 0
  done, num sentences 7525, time 63s
  pearson dev: 0.1303
  saving hparams to ./saved_qe_model_en_de/hparams
# External evaluation, global step 0

...

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[28220,50000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node transformerpredictor/decoder/output_projection/Tensordot/MatMul (defined at qe_model.py:1272)  = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformerpredictor/decoder/output_projection/Tensordot/Reshape, transformerpredictor/decoder/output_projection/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node transformerpredictor/estimator/bidirectional_rnn/bw/bw/All/_589}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2315_transformerpredictor/estimator/bidirectional_rnn/bw/bw/All", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

And it doesn't matter whether I am using TitanX or 1080Ti.

Any idea?

Thanks. Dimitar

dimitarsh1 commented 5 years ago

This happens only if the I invoke run_external_eval on the test set. If I don't specify a test set - it works. Until the averaging point (see newly reported issue).

Cheers, Dimitar

lovecambi commented 5 years ago

which test set are you using? Can you check the maximum length of the sentences in your test set?

dimitarsh1 commented 5 years ago

Hi.

I managed to run the train and inference without a problem on a GTX 1080Ti (11GB) when I decreased the qe_batch_size and the inference_batch_size to 10 and 1 respectively. The thing is that during inference, I could not set the batch size at all neither by setting the qe_batch_size nor the infer_batch_size and I retrained the qe model. But once that was done - the inference went smoothly.

Thanks for looking into this. I guess during inference it should be possible to set the batch size which in my case didn't work.

Cheers, Dimitar

lovecambi commented 5 years ago

hparms has type tf.contrib.training.HParams. If you want to set a smaller infer batch size without retraining the model, you can add

setattr(hparams, "infer_batch_size", 1)

after https://github.com/lovecambi/qebrain/blob/771519de047279ea25d76b056ed64b78da6cc7c3/qe_model.py#L2170

lovecambi commented 5 years ago

I think the problem has been solved.