microsoft / DialoGPT

Large-scale pretraining for dialogue
MIT License
2.36k stars 341 forks source link

Fine-tune with own dataset - how to multi-turn ? #36

Open GraphGrailAi opened 4 years ago

GraphGrailAi commented 4 years ago

When i run python demo.py a got an advice how to train model with own data:

python LSP_train.py --model_name_or_path /home/joo/Docs/LocalRepository/DialoGPT/models/small --init_checkpoint /home/joo/Docs/LocalRepository/DialoGPT/models/small/pytorch_model.bin --train_input_file /home/joo/Docs/LocalRepository/DialoGPT/data/train.128len.db --eval_input_file ./data/dummy_data.tsv --output_dir /home/joo/Docs/LocalRepository/DialoGPT/models/output_model --seed 42 --max_seq_length 128 --train_batch_size 512 --gradient_accumulation_steps 8 --eval_batch_size 64 --learning_rate 1e-5 --num_optim_steps 10000 --valid_step 5000 --warmup_steps 4000 --normalize_data true --fp16 true --lr_schedule noam --loss_scale 0.0 --no_token_id true --pbar true

But here in above setting i cannot figure out where is the training dataset specified (DialoGPT/data/train.128len.db consist of 5 files with no actual dataset) and what is the right format for dataset to fine-tune?

According to original Huggingface format - they use json with lists of PERSONALITY and history of conversation with candidates to responses: https://github.com/huggingface/transfer-learning-conv-ai/blob/master/example_entry.py. (Additional - how to implement personality?)

Also, in issue https://github.com/microsoft/DialoGPT/issues/17 multi-turn dialog format is discussed, but where is Turn1 <|endoftext|> Turn2 <|endoftext|> ... TurnN

intersun commented 4 years ago

Due to nature of large dataset, we use prepro.py to preprocess raw data (formatted like train_raw.tsv) and output data into multiple chunks...

For detailed format please refer to line 56 of prepro.py

def _make_features(id_, weights, inputs, tokenizer, max_len):
    # ....
LooperXX commented 4 years ago

Hi, @intersun and @dreasysnail. I am trying to fine-tune the model with my own dataset. I failed to run python demo.py --data small so that I can't know the exact format of the .tsv file. Could you please help me confirm if the format of my data set(.tsv file) is correct:

0.0 utt1 EOS 1.0 utt2 EOS 1.0 utt3 \t 1.0 i am a admin .\n

I plan to build a tsv file with format as above as my own fine-tuning data set, load medium_ft.pkl to fine-tune my data set based on your DialoGPT(medium) model, and then test the generation performance by the example likes:

0.0 utt1 EOS 0.0 utt2 EOS 0.0 utt3 \n

to ask DialoGPT to predict the anwser "i am a admin .".

In other words, 0.0 means that the model will not make predictions for this turn, and 1.0 means that predictions are needed.

(Actually I am confused that should I distinguish between user and system turns: 0.0 to user turn and 1.0 to system turn, so that the model only need to predict each system turn. Because the model just need to predict the system utterance in the evaluation. But maybe all 0.0 will help train the model with more data.)

Hope to get your reply. Thanks. 🙏