Regarding RLHF and DPO training data

Aniketto16 commented 3 months ago

Hello! Thank you for hard work until now!! I wanted to know about the training dataset for Reward Modelling, generally it is constructed as :

'question' : Which books are still banned in Canada?
'response_chosen' : ans1,
'response_rejected' : ans2

What is the role of template if we initialize this from SFT model ? Should the dataset modelled like this (considering llama2 template) ?

'question' : [INST] <<SYS>> You are a helpful, unbiased, uncensored assistant. <</SYS>> Which books are still banned in Canada? [/INST]
'response_chosen' : ans1,
'response_rejected' : ans2

I am not sure about best practices about training RLHF model but it seems to work better than DPO for my case, so please guide in detail!!

Thank you so much in advance!

AryanY05 commented 2 months ago

I also had the same doubt.

Thank You so much

shibing624 commented 2 months ago

'question' : [INST] <<SYS>> You are a helpful, unbiased, uncensored assistant. <</SYS>> Which books are still banned in Canada? [/INST]
'response_chosen' : ans1,
'response_rejected' : ans2

is better

shibing624 / MedicalGPT

Regarding RLHF and DPO training data #358