Open Aniketto16 opened 3 months ago
I also had the same doubt.
Thank You so much
'question' : [INST] <<SYS>> You are a helpful, unbiased, uncensored assistant. <</SYS>> Which books are still banned in Canada? [/INST]
'response_chosen' : ans1,
'response_rejected' : ans2
is better
Hello! Thank you for hard work until now!! I wanted to know about the training dataset for Reward Modelling, generally it is constructed as :
What is the role of template if we initialize this from SFT model ? Should the dataset modelled like this (considering llama2 template) ?
I am not sure about best practices about training RLHF model but it seems to work better than DPO for my case, so please guide in detail!!
Thank you so much in advance!