How did you train the complexity & quality scorer

First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.

I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts:

Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"

Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"

During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

@VPeterV Thanks very much for your clarification! May I also have another follow up questions.

Since the {score} may be a single int value or a float value with limited tokens like one or two tokens, the loss maybe not stable in this scenario, do you have insights to measure how reliable is the scorer model? and do I need to retrain the model if I want to generalize to other datasets evaluations such as UltraChat?
I totally agree the findings that direct scoring may raise higher rating scores. When I try to use your designed scorer model, do I need to put 6 examples into the prompt (Table 14) to obtain the Rank & Complexity score? as noted in the sentence We emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these samples represent different evolution stages of the same original sample and such a scoring scheme helps ChatGPT capture the small complexity differences among them, If so, which means I need to generate 6 variations given an example, otherwise we may have same issues as the direct scoring prompt.

Looking forward to hearing your insights!

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

@VPeterV Thanks very much for your clarification! May I also have another follow up questions.

Since the {score} may be a single int value or a float value with limited tokens like one or two tokens, the loss maybe not stable in this scenario, do you have insights to measure how reliable is the scorer model? and do I need to retrain the model if I want to generalize to other datasets evaluations such as UltraChat?

I totally agree the findings that direct scoring may raise higher rating scores. When I try to use your designed scorer model, do I need to put 6 examples into the prompt (Table 14) to obtain the Rank & Complexity score? as noted in the sentence We emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these samples represent different evolution stages of the same original sample and such a scoring scheme helps ChatGPT capture the small complexity differences among them, If so, which means I need to generate 6 variations given an example, otherwise we may have same issues as the direct scoring prompt.

Looking forward to hearing your insights!

Thanks for ur interest again! @jianguoz

For our scorers, scores tokens [1-6] are all single tokens.

# from llama-1's vocab
id2score = {
            29896: "1",
            29906: "2",
            29941: "3",
            29946: "4",
            29945: "5",
            29953: "6"
            }

We extract the probs of these six tokens directly during inference as you can see on our code: https://github.com/hkust-nlp/deita/blob/5705b19377cde7b1f008cb29b8a8bcc96d1737c0/src/deita/selection/scorer/base.py#L44C30-L44C30 And actually, we do not straightforwardly output the score. We instead compute a weighted sum of these six scores with their probs. Therefore, if you maintain consistency in your prompts during training and inference and adopt a similar method to ours for score extraction (such as using a weighted sum), I believe that stability should not pose a significant concern

Regarding measurement, it is a good question. One intuitive way is to split some scoring data for validation and compare the results from scorers with results from other "stronger-intelligent" like GPT-4. We are considering including this evaluation in the release of our next-version scorers, such as those developed using Mistral-7B
For generalization, I think it relies on the domains of data, which you are willing to score. I think Ultrachat is totally OK, This is because UltraChat's primary focus on instruction-following aligns well with our dataset's main emphasis. Additionally, our full 300K dataset includes a segment of UltraChat data. However, for scoring in specialized domains such as mathematics (e.g., GSM8K, MATH), the efficacy of our current scorers might be less certain. They may not guarantee optimal performance in these areas, but they can still offer valuable basic data quality control, which can be beneficial in many scenarios
No, it is not necessary to provide any examples. You can directly score your samples. The easiest way to implement this is to use the code we provide. The 'Rank & Score' technique was primarily utilized to gather training data for our scorers. One of our goals is to establish a reliable dataset for training a scorer capable of directly evaluating samples without the need for additional examples.

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?

I have carefully reviewed and understood the issue that I was confused about. Thanks a lot.

hkust-nlp / deita

How did you train the complexity & quality scorer #3