LLMs are typically trained in phases: a self-supervised pre-training phase, followed by supervised alignment tuning phases.
Alignment tuning, typically happens in two stages: instruction tuning, followed by preference tuning.
Instruction tuning is more akin to the traditional model training approach in machine learning, where the model is trained directly on tasks of interest. In this stage, the model is given a task description in the form of an natural language instuction (e.g. Summarize the following news article in 2 lines: {News article}) and the model is trained to maximize the likelihood of the provided ground truth summary
Preference tuning, on the other hand, is done using techniques such as RLHF and DPO, where the response from an instruction-tuned model is rated as preferred or unpreferred using human feedback.
using a larger teacher model to generate synthetic data to train a smaller student model, and incorporating principles in the generation prompt to promote diversity in the generated instruction data
https://huggingface.co/ibm/labradorite-13b
Summary of the Article:
Contributions:
Implications:
Suggested Related Papers:
Citations: [1] https://huggingface.co/ibm/labradorite-13b [2] https://huggingface.co/PygmalionAI/metharme-13b [3] https://huggingface.co/dfurman/LLaMA-13B
Originally posted by @manisnesan in https://github.com/manisnesan/fastchai/issues/47#issuecomment-1968021707