hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
Apache License 2.0
502 stars 27 forks source link

How did you train the complexity & quality scorer #3

Closed philschmid closed 10 months ago

philschmid commented 11 months ago

First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.

I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?

VPeterV commented 11 months ago

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts:

    Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"

    Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}"

    During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)

  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

jianguoz commented 11 months ago

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

VPeterV commented 11 months ago

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

jianguoz commented 11 months ago

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

@VPeterV Thanks very much for your clarification! May I also have another follow up questions.

Looking forward to hearing your insights!

VPeterV commented 11 months ago

Hi @VPeterV , thanks for your valuable work and the response! I have a follow up question that could you clarify is the loss only optimized on the last {score} or the whole prompt, i.e., "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}"?

Hi! Following most SFT works, we only optimized the loss on the last {score}

@VPeterV Thanks very much for your clarification! May I also have another follow up questions.

  • Since the {score} may be a single int value or a float value with limited tokens like one or two tokens, the loss maybe not stable in this scenario, do you have insights to measure how reliable is the scorer model? and do I need to retrain the model if I want to generalize to other datasets evaluations such as UltraChat?
  • I totally agree the findings that direct scoring may raise higher rating scores. When I try to use your designed scorer model, do I need to put 6 examples into the prompt (Table 14) to obtain the Rank & Complexity score? as noted in the sentence We emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these samples represent different evolution stages of the same original sample and such a scoring scheme helps ChatGPT capture the small complexity differences among them, If so, which means I need to generate 6 variations given an example, otherwise we may have same issues as the direct scoring prompt.

Looking forward to hearing your insights!

Thanks for ur interest again! @jianguoz

We extract the probs of these six tokens directly during inference as you can see on our code: https://github.com/hkust-nlp/deita/blob/5705b19377cde7b1f008cb29b8a8bcc96d1737c0/src/deita/selection/scorer/base.py#L44C30-L44C30 And actually, we do not straightforwardly output the score. We instead compute a weighted sum of these six scores with their probs. Therefore, if you maintain consistency in your prompts during training and inference and adopt a similar method to ours for score extraction (such as using a weighted sum), I believe that stability should not pose a significant concern

michaelnparis commented 9 months ago

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

VPeterV commented 9 months ago

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?

michaelnparis commented 9 months ago

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi! Thanks again for your interest! 😄

  1. Training of Scorers: Indeed, we trained our scorers using a straightforward next-token prediction task on collected data samples. For quality and complexity, we use the following prompts: Quality: "You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: {score}" Complexity: "You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction} \n##Complexity: {score}" During inference, to determine the score for each sample, refer to our code where we extract the probabilities for the 6 scores (ranging from 1 to 6)
  2. Training Data Volume: Each scorer was trained on a total of 6,000 data samples. This includes 1,000 initial seed data samples and 5,000 evolved samples, derived over five iterations from each initial sample. Thus, the final dataset comprises 1,000 initial seed data plus 5,000 evolved data. And we will release the two data soon.

Hi @VPeterV, in the EVOL COMPLEXITY section of the paper, I only see 4 prompts. How did you generate 5000 evolved samples from these? Thank you for your interest.

Hi, could you please provide more details about the "4 prompts" or share additional information to help me better understand and locate the question you are referring to?

I have carefully reviewed and understood the issue that I was confused about. Thanks a lot.

harshitadd commented 2 months ago

Hello! Thanks for the great work. Can you please clarify the intuition behind using the weighted sum of probabilities as the score for the sample ? I can see from the discussion that this might be motivated by ensuring more stable training perhaps, but more details would be very helpful. Especially since this value is softmaxe'd I don't see any variation beyond 1 being possible (and I have empirically confirmed this by computing the the quality and complexity score using your LLaMa scorer.). Thanks!