Trained e2e T5 model doesn't quite match you e2e model

huu4ontocord commented 3 years ago

Dear @patil-suraj!

This is a wonderul package. Thank you creating it.

I've tried to train a t5 e2e model using your script, but changing a few hyperparameters due to cuda memory issues on colab. I'm seeing good results, but not quite the same as your e2e model. I'm wondering if I need to train longer or change other hyper parmeters.

Here is a colab for the training: https://colab.research.google.com/drive/1xDltBUhUj-ericq-oyhyLIPajfNflJkU?usp=sharing

Parameters to prepare data:

!python prepare_data.py --task e2e_qg --model_type t5 \ --valid_for_qg_only \ --dataset_path data/squad_multitask/ \ --train_file_name /content/drive/MyDrive/question_generation/data/train.pt \ --valid_file_name /content/drive/MyDrive/question_generation/data/valid.pt \ --qg_format highlight_qg_format \ --max_source_length 512 \ --max_target_length 32 \

Parameters for training:

args_dict = {
    "model_name_or_path": "t5-small",
    "model_type": "t5",
    "tokenizer_name_or_path": "t5_qg_tokenizer",
    "output_dir": "/content/drive/MyDrive/question_generation/model/t5-e2e-qg",
    "train_file_path": "/content/drive/MyDrive/question_generation/data/train.pt",
    "valid_file_path": "/content/drive/MyDrive/question_generation/data/valid.pt",
    "do_train": True,
    "do_eval": True,
    "evaluate_during_training": True,
    "logging_steps": 1000,
    "learning_rate": 1e-4,
    "num_train_epochs": 10,
}

Testing:

%cd /content/question_generation
from pipelines import pipeline
nlp = pipeline("e2e-qg", model="/content/drive/MyDrive/question_generation/model/t5-e2e-qg")
nlp_orig = pipeline("e2e-qg")
print (nlp(text))
print (nlp_orig(text))
print ('******')
print (nlp(text2))
print (nlp_orig(text2))
print ('******')
print (nlp(text3))
print (nlp_orig(text3))
print ('******')
print (nlp(text4))
print (nlp_orig(text4))

And results:

/content/question_generation /usr/local/lib/python3.6/dist-packages/transformers/tokenization_t5.py:184: UserWarning: This sequence already has . In future versions this behavior may lead to duplicated eos tokens being added. f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added." ['Who created Python?', 'When was Python first released?', "What does Python's design philosophy emphasize with its notable use of whitespace?"] ['Who created Python?', 'When was Python first released?', "What is Python's design philosophy?"]

['Gravity is a natural phenomenon by which all things with mass or energy are brought toward one another?'] ['What is the Latin word for gravitas?', 'What does gravity give weight to on Earth?', "The Moon's gravity causes what?", 'Gravity has an infinite range, but its effects become weaker as objects get further away?']

['What is the answer to life, universe and everything?'] ['What is the answer to life, universe and everything?']

['Forrest Gump is a slow-witted but kind-hearted man from what state?'] ['What is the story about forrest Gump?', 'Who is the slow-witted but kind-hearted man from Alabama?', 'When did Gump witness and influence several historical events in the 20th century?']

++

Any help is much apprecaited!

sabhi27 commented 2 years ago

Hi @ontocord , Can you please share your training repo with me. I am trying to train for multitask model, your training pipeline might help me here.

viru-12 commented 2 years ago

Hi @ontocord ,can you please share your training repo with me. It will be very helpful.

patil-suraj / question_generation

Trained e2e T5 model doesn't quite match you e2e model #47