openchat3.5 training data formatting

imoneoi / openchat

OpenChat: Advancing Open-source Language Models with Imperfect Data

https://openchat.team

Apache License 2.0

5.22k stars 397 forks source link

openchat3.5 training data formatting #59

Open bpucla opened 10 months ago

bpucla commented 10 months ago

Congrats to the authors on the great achievement!

Trying to understand your great work a bit more. In the inference examples, there are prompts like GPT4 Correct User, Code User. What are other conditional prompts used in training? What does Correct mean here? Thanks!

imoneoi commented 10 months ago

Correct means verified correct answers. Besides, GPT4 and Human were also used, indicating data with unknown correctness.

bpucla commented 10 months ago

@imoneoi thank you for prompt response! I have two follow-up questions.

Are only the following prompts are used in training? a). GPT4 Correct User and GPT4 Correct Assistant b). GPT4 User and GPT4 Assistant c). Code User and Code Assistant
If the answer to previous question is Yes, does it mean you only used the GPT-4 data and did not use GPT-3.5 data?

Really appreciate your answers!

imoneoi commented 10 months ago

@bpucla

Yes, and Human User Human Assistant
Yes. GPT-3.5 data is discarded in the 3.5 version

yucc-leon commented 10 months ago

@bpucla

Yes, and Human User Human Assistant是的，还有 Human User Human Assistant

Yes. GPT-3.5 data is discarded in the 3.5 version是的。 GPT-3.5数据在3.5版本中被丢弃

Can more details be told? Because in that case V3.5 surpassed ChatGPT by using some secret datasets and it became hard for us to know what on earth made the biggest progress...

ryusaeba commented 8 months ago

Are you using 0.1 of weight for the data with unknown correctness and 1.0 of weight for correct one? If not, could you please reveal more details?

houghtonweihu commented 6 months ago

@imoneoi You said: GPT-3.5 data is discarded in the 3.5 version

Does the existence of GPT-3.5 data provide the meaning for rewards and class-conditioned policy? The table 2 in your paper shows the value of GPT-3.5 which allows for rewards and class-conditioned policy?