princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
MIT License
615 stars 36 forks source link

For the Instruct setup, why do different models require different training datasets? Can the same dataset be used? #19

Closed qiuwenbogdut closed 1 month ago

qiuwenbogdut commented 2 months ago
image image

For the Instruct setup, why do different models require different training datasets? Can the same dataset be used?

yumeng5 commented 2 months ago

Hi,

The instruction setups are meant to represent "on-policy" settings where the responses in the preference dataset are generated by the to-be-trained policy model itself. Therefore, the mistral-instruct and llama3-instruct setups use the UltraFeedback prompts but regenerate winning/losing responses with mistral-instruct and llama3-instruct, respectively.

There is nothing wrong with using a different preference training dataset, but it'll change the "on-policy" setting to an "off-policy" setting (similar to the base setups in our paper), and it's expected that the performance of the final model will generally become worse.

Best, Yu

qiuwenbogdut commented 2 months ago

for the base setups , whether "on-policy" settings (regenerate winning/losing responses with mistral-base and llama3-base) will get better performance.

yumeng5 commented 2 months ago

In general on-policy settings will yield better performance, though we haven't tried it under the base setups -- the goal of creating the base setups was to facilitate transparent and direct comparisons with models (e.g., Zephyr) trained using the same off-the-shelf preference datasets (e.g., UltraFeedback).