Open Tanmay06 opened 5 years ago
Have u checked u might be using bigger batch size which doesn't fit in your memory
No I haven't changed any batch sizes, by default it should be 1 for inference too, right ? And I'm using Azure Nvidia Tesla M60 GPU with 8Gib of memory.
I think model is around 250 million parameters I doubt 8gb can handle this along with the data. Please try with 16gb ram.
But I was able to train the model on same gpu without any issues. I'm facing this problem only when i try to do inference on the trained model / last checkpoint.
Please post the link to inference code you are running. 512,10,50,512 this tensor size seems to be wrong. Might be the problem with way you are passing the data. Check the size of input tensor it should be 1×512
This is the inference code that I'm running.
`from flask import Flask,request,render_template import requests import json from collections import OrderedDict import os import numpy as np import tensorflow as tf
app =Flask(name)
import sys
if not 'texar_repo' in sys.path: sys.path += ['texar_repo']
from config import from model import from preprocess import *
start_tokens = tf.fill([tx.utils.get_batch_size(src_input_ids)], bos_token_id) predictions = decoder( memory=encoder_output, memory_sequence_length=src_input_length, decoding_strategy='infer_greedy', beam_width=beam_width, alpha=alpha, start_tokens=start_tokens, end_token=eos_token_id, max_decoding_length=300, mode=tf.estimator.ModeKeys.PREDICT ) if beam_width <= 1: inferred_ids = predictions[0].sample_id else:
inferred_ids = predictions['sample_id'][:, :, 0]
tokenizer = tokenization.FullTokenizer( vocab_file=os.path.join(bert_pretrain_dir, 'vocab.txt'), do_lower_case=True)
sess = tf.Session() def infer_single_example(story,actual_summary,tokenizer): example = {"src_txt":story, "tgt_txt":actual_summary } features = convert_single_example(1,example,max_seq_length_src,max_seq_length_tgt,tokenizer) feed_dict = { src_input_ids:np.array(features.src_input_ids).reshape(-1,1), src_segment_ids : np.array(features.src_segment_ids).reshape(-1,1)
}
references, hypotheses = [], []
fetches = {
'inferred_ids': inferred_ids,
}
fetches_ = sess.run(fetches, feed_dict=feed_dict)
labels = np.array(features.tgt_labels).reshape(-1,1)
hypotheses.extend(h.tolist() for h in fetches_['inferred_ids'])
# references.extend(r.tolist() for r in labels)
hypotheses = utils.list_strip_eos(hypotheses, eos_token_id)
# references = utils.list_strip_eos(references, eos_token_id)
hwords = tokenizer.convert_ids_to_tokens(hypotheses[0])
# rwords = tokenizer.convert_ids_to_tokens(references[0])
hwords = tx.utils.str_join(hwords).replace(" ##","")
# rwords = tx.utils.str_join(rwords).replace(" ##","")
# print("Original",rwords)
print("Generated",hwords)
return hwords
@app.route("/results",methods=["GET","POST"]) def results(): story = request.form['story'] summary = request.form['summary'] hwords = infer_single_example(story,summary,tokenizer) return hwords
if name=="main": sess.run(tf.global_variables_initializer()) sess.run(tf.local_variables_initializer()) sess.run(tf.tables_initializer()) saver.restore(sess, tf.train.latest_checkpoint(model_dir))
story = "Story text about 200 tokens"
summary = "Summary text about 150 tokens"
# story = input("Enter article:").strip("/n")
# summary = input("Enter summary:").strip("/n")
hwords = infer_single_example(story.strip("/n"), summary.strip("/n"), tokenizer)
print(hwords)
and this is the config,py
import texar as tx
dcoder_config = {
'dim': 768,
'num_blocks': 6,
'multihead_attention': {
'num_heads': 8,
'output_dim': 768
},
'position_embedder_hparams': {
'dim': 768
},
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode': 'fan_avg',
'distribution': 'uniform',
},
},
'poswise_feedforward': tx.modules.default_transformer_poswise_net_hparams(
output_dim=768)
}
loss_label_confidence = 0.9
random_seed = 1234 beam_width = 5 alpha = 0.6 hidden_dim = 768
opt = { 'optimizer': { 'type': 'AdamOptimizer', 'kwargs': { 'beta1': 0.9, 'beta2': 0.997, 'epsilon': 1e-9 } } }
lr = { 'learning_rate_schedule': 'constant.linear_warmup.rsqrt_decay.rsqrt_depth', 'lr_constant': 2 * (hidden_dim ** -0.5), 'static_lr': 1e-3, 'warmup_steps': 10000, }
bos_token_id =101 eos_token_id = 102
model_dir= "./models" run_mode= "train_and_evaluate" batch_size = 1 eval_batch_size = 1 test_batch_size =1
max_train_steps = 100000
display_steps = 1 checkpoint_steps = 500 eval_steps = 50000
max_decoding_length = 400
max_seq_length_src = 512 max_seq_length_tgt = 400
epochs =10
is_distributed = False
data_dir = r"data/"
train_out_file = r"data/train.tf_record" eval_out_file = r"data/eval.tf_record"
bert_pretrain_dir=r"./bert_uncased_model"
train_story = r"data/train_story.txt" train_summ = r"data/train_summ.txt"
eval_story = r"data/eval_story.txt" eval_summ = r"data/eval_summ.txt"
bert_pretrain_dir = r"../uncased_L-12_H-768_A-12"
`
Can you change this in infer single example src_input_ids:np.array(features.src_input_ids).reshape(1,-1), src_segment_ids : np.array(features.src_segment_ids).reshape(1,-1)
Hello, I met the same problem. How did you solve it?@Tanmay06
Hi, actually I was away and was working on a different project. @Simons2017 I think you should try @santhoshkolloju 's reply just before your comment. I think it should work.
@santhoshkolloju, hi, How do I change this? src_input_ids:np.array(features.src_input_ids).reshape(1,-1), src_segment_ids : np.array(features.src_segment_ids).reshape(1,-1),
@yuyanzhoufang change this line in the original code. https://github.com/santhoshkolloju/Abstractive-Summarization-With-Transfer-Learning/blob/97ff2ae3ba9f2d478e174444c4e0f5349f28c319/Inference.py#L56-L57
hi, in class "CNNDailymail", "if set_type == "test" and i == 0:continue",Why filter out "test == 0" ?
I've partially trained the model, but when I went for testing the model and ran Inference.py, with static story and summaries in the script, it gave me the insufficient memory error from tensorflow.
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512,10,50,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc