Closed philschmid closed 10 months ago
Hi! Thanks again for your interest! 😄
Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts:
Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:
"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score}
or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?
Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last
{score}
or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?
Hi! Following most SFT works, we only optimized the loss on the last {score}
Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last
{score}
or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?Hi! Following most SFT works, we only optimized the loss on the last {score}
@VPeterV Thanks very much for your clarification! May I also have another follow up questions.
We emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these samples represent different evolution stages of the same original sample and such a scoring scheme helps ChatGPT capture the small complexity differences among them,
If so, which means I need to generate 6 variations given an example, otherwise we may have same issues as the direct scoring prompt. Looking forward to hearing your insights!
Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last
{score}
or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?Hi! Following most SFT works, we only optimized the loss on the last {score}
@VPeterV Thanks very much for your clarification! May I also have another follow up questions.
- Since the {score} may be a single int value or a float value with limited tokens like one or two tokens, the loss maybe not stable in this scenario, do you have insights to measure how reliable is the scorer model? and do I need to retrain the model if I want to generalize to other datasets evaluations such as UltraChat?
- I totally agree the findings that direct scoring may raise higher rating scores. When I try to use your designed scorer model, do I need to put 6 examples into the prompt (Table 14) to obtain the Rank & Complexity score? as noted in the sentence
We emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these samples represent different evolution stages of the same original sample and such a scoring scheme helps ChatGPT capture the small complexity differences among them,
If so, which means I need to generate 6 variations given an example, otherwise we may have same issues as the direct scoring prompt.Looking forward to hearing your insights!
Thanks for ur interest again! @jianguoz
# from llama-1's vocab
id2score = {
29896: "1",
29906: "2",
29941: "3",
29946: "4",
29945: "5",
29953: "6"
}
We extract the probs of these six tokens directly during inference as you can see on our code: https://github.com/hkust-nlp/deita/blob/5705b19377cde7b1f008cb29b8a8bcc96d1737c0/src/deita/selection/scorer/base.py#L44C30-L44C30 And actually, we do not straightforwardly output the score. We instead compute a weighted sum of these six scores with their probs. Therefore, if you maintain consistency in your prompts during training and inference and adopt a similar method to ours for score extraction (such as using a weighted sum), I believe that stability should not pose a significant concern
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.
Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi! Thanks again for your interest! 😄
- Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality:
"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"
Complexity:"You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"
During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)- Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.
Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.
Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?
I have carefully reviewed and understood the issue that I was confused about. Thanks a lot.
Hello! Thanks for the great work. Can you please clarify the intuition behind using the weighted sum of probabilities as the score for the sample ? I can see from the discussion that this might be motivated by ensuring more stable training perhaps, but more details would be very helpful. Especially since this value is softmaxe'd I don't see any variation beyond 1 being possible (and I have empirically confirmed this by computing the the quality and complexity score using your LLaMa scorer.). Thanks!
First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.
I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?