shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
2.94k stars 451 forks source link

Regarding RLHF and DPO training data #358

Open Aniketto16 opened 3 months ago

Aniketto16 commented 3 months ago

Hello! Thank you for hard work until now!! I wanted to know about the training dataset for Reward Modelling, generally it is constructed as :

'question' : Which books are still banned in Canada?
'response_chosen' : ans1,
'response_rejected' : ans2

What is the role of template if we initialize this from SFT model ? Should the dataset modelled like this (considering llama2 template) ?

'question' : [INST] <<SYS>> You are a helpful, unbiased, uncensored assistant. <</SYS>> Which books are still banned in Canada? [/INST]
'response_chosen' : ans1,
'response_rejected' : ans2

I am not sure about best practices about training RLHF model but it seems to work better than DPO for my case, so please guide in detail!!

Thank you so much in advance!

AryanY05 commented 2 months ago

I also had the same doubt.

Thank You so much

shibing624 commented 2 months ago
'question' : [INST] <<SYS>> You are a helpful, unbiased, uncensored assistant. <</SYS>> Which books are still banned in Canada? [/INST]
'response_chosen' : ans1,
'response_rejected' : ans2

is better