santhoshkolloju / Abstractive-Summarization-With-Transfer-Learning

Abstractive summarisation using Bert as encoder and Transformer Decoder
406 stars 99 forks source link

train.tf_record not found #2

Closed callzhang closed 5 years ago

callzhang commented 5 years ago

Can you provide train.tf_record file? NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{ "error": { "errors": [ { "domain": "global", "reason": "notFound", "message": "No such object: bert_summarization/train.tf_record" } ], "code": 404, "message": "No such object: bert_summarization/train.tf_record" } } ' when reading metadata of gs://bert_summarization/train.tf_record [[node IteratorGetNext_1 (defined at texar_repo/texar/data/data/data_iterators.py:401) ]]

santhoshkolloju commented 5 years ago

I have provided the code to generate the tf records file

Vibha111094 commented 5 years ago

Do we need to create an empty file gs://bert_summ/train.tf_record and then call the function 'file_based_convert_examples_to_features' ?

callzhang commented 5 years ago

I have provided the code to generate the tf records file

Is this the right way?

def get_dataset(processor,
                tokenizer,
                data_dir,
                max_seq_length_src,
                max_seq_length_tgt,
                batch_size,
                mode,
                output_dir,
                is_distributed=False):
    """
    Args:
        processor: Data Preprocessor, must have get_lables,
            get_train/dev/test/examples methods defined.
        tokenizer: The Sentence Tokenizer. Generally should be
            SentencePiece Model.
        data_dir: The input data directory.
        max_seq_length: Max sequence length.
        batch_size: mini-batch size.
        model: `train`, `eval` or `test`.
        output_dir: The directory to save the TFRecords in.
    """
    #label_list = processor.get_labels()
    if mode == 'train':
        train_examples = processor.get_train_examples(data_dir)
        #train_file = os.path.join(output_dir, "train.tf_record")
        train_file = "gs://bert_summarization/train.tf_record"
        file_based_convert_examples_to_features(
           train_examples, max_seq_length_src,max_seq_length_tgt,
           tokenizer, train_file)
        dataset = file_based_input_fn_builder(
            input_file=train_file,
            max_seq_length_src=max_seq_length_src,
            max_seq_length_tgt =max_seq_length_tgt,
            is_training=True,
            drop_remainder=True,
            is_distributed=is_distributed)({'batch_size': batch_size})
    elif mode == 'eval':
        eval_examples = processor.get_dev_examples(data_dir)
        #eval_file = os.path.join(output_dir, "eval.tf_record")
        eval_file = "gs://bert_summarization/eval.tf_record"
        file_based_convert_examples_to_features(
           eval_examples, max_seq_length_src,max_seq_length_tgt,
           tokenizer, eval_file)
        dataset = file_based_input_fn_builder(
            input_file=eval_file,
            max_seq_length_src=max_seq_length_src,
            max_seq_length_tgt =max_seq_length_tgt,
            is_training=False,
            drop_remainder=True,
            is_distributed=is_distributed)({'batch_size': batch_size})
    elif mode == 'test':

        test_examples = processor.get_test_examples(data_dir)
        #test_file = os.path.join(output_dir, "predict.tf_record")
        test_file = "gs://bert_summarization/predict.tf_record"

        file_based_convert_examples_to_features(
           test_examples, max_seq_length_src,max_seq_length_tgt,
           tokenizer, test_file)
        dataset = file_based_input_fn_builder(
            input_file=test_file,
            max_seq_length_src=max_seq_length_src,
            max_seq_length_tgt =max_seq_length_tgt,
            is_training=False,
            drop_remainder=True,
            is_distributed=is_distributed)({'batch_size': batch_size})
    return dataset
santhoshkolloju commented 5 years ago

give me few days i will write a detailed post how you can run on your data

callzhang commented 5 years ago

give me few days i will write a detailed post how you can run on your data

That will be awesome. I have gone ahead started training with the code modification mentioned above. I would love to read your post and learn more details about it.