magpie-align / magpie

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
https://magpie-align.github.io/
MIT License
374 stars 38 forks source link

Question about Step1 and Step2 being separated steps. #1

Closed xinyazha92 closed 2 months ago

xinyazha92 commented 2 months ago

Hi thank you for this interesting work! The paper mentions that MAGPIE is a 2-stage pipeline (step1 is instruction generation and step2 is response generation). How is it different to the strategy where Step1 and 2 are unified in one pass? Since LLM is auto regressive, wouldn't the model generate instruction and response altogether?

yuchenlin commented 2 months ago

Hey Xinya, thanks for the question! yeah, you're right, technically you generate query 1, response 1, query 2, response 2 all in the same inference (by setting specific stopping criteria). We choose to use multiple stages to collect them step by step so that we can have more control:

  1. we'll need to remove some duplicate queries and queries that are noisy/too easy (see our paper for details). Noisy queries may not end properly with the correct <|eot_id|> so we need to post-process them.
  2. we'll need to adjust temperature for sampling query/responses. we find using a high temp for sampling queries is better and lower temp for sampling responses.
  3. also we use API-like inference engine for some inference which would be more efficient for us to monitor the progress by splitting the data collection stages.
xinyazha92 commented 2 months ago

That makes sense. Thank you for such a quick reply!