Closed callzhang closed 5 years ago
I have provided the code to generate the tf records file
Do we need to create an empty file gs://bert_summ/train.tf_record and then call the function 'file_based_convert_examples_to_features' ?
I have provided the code to generate the tf records file
Is this the right way?
def get_dataset(processor,
tokenizer,
data_dir,
max_seq_length_src,
max_seq_length_tgt,
batch_size,
mode,
output_dir,
is_distributed=False):
"""
Args:
processor: Data Preprocessor, must have get_lables,
get_train/dev/test/examples methods defined.
tokenizer: The Sentence Tokenizer. Generally should be
SentencePiece Model.
data_dir: The input data directory.
max_seq_length: Max sequence length.
batch_size: mini-batch size.
model: `train`, `eval` or `test`.
output_dir: The directory to save the TFRecords in.
"""
#label_list = processor.get_labels()
if mode == 'train':
train_examples = processor.get_train_examples(data_dir)
#train_file = os.path.join(output_dir, "train.tf_record")
train_file = "gs://bert_summarization/train.tf_record"
file_based_convert_examples_to_features(
train_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, train_file)
dataset = file_based_input_fn_builder(
input_file=train_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=True,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
elif mode == 'eval':
eval_examples = processor.get_dev_examples(data_dir)
#eval_file = os.path.join(output_dir, "eval.tf_record")
eval_file = "gs://bert_summarization/eval.tf_record"
file_based_convert_examples_to_features(
eval_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, eval_file)
dataset = file_based_input_fn_builder(
input_file=eval_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=False,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
elif mode == 'test':
test_examples = processor.get_test_examples(data_dir)
#test_file = os.path.join(output_dir, "predict.tf_record")
test_file = "gs://bert_summarization/predict.tf_record"
file_based_convert_examples_to_features(
test_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, test_file)
dataset = file_based_input_fn_builder(
input_file=test_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=False,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
return dataset
give me few days i will write a detailed post how you can run on your data
give me few days i will write a detailed post how you can run on your data
That will be awesome. I have gone ahead started training with the code modification mentioned above. I would love to read your post and learn more details about it.
Can you provide train.tf_record file?
NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{ "error": { "errors": [ { "domain": "global", "reason": "notFound", "message": "No such object: bert_summarization/train.tf_record" } ], "code": 404, "message": "No such object: bert_summarization/train.tf_record" } } ' when reading metadata of gs://bert_summarization/train.tf_record [[node IteratorGetNext_1 (defined at texar_repo/texar/data/data/data_iterators.py:401) ]]