Papers - Githubissues

[Instruction tuning with GPT-4, Microsoft, 2023.04]

This paper intent to build the first Self-Instruct LLM using GPT-4 response based on LLaMA-7B. blog.
1st, it collect 52K English instruction dataset (and 52K Chinese from translation) using GPT-4 with 52K prompt input of alpaca dataset. Then it performs supervised fine-tune (as Stanford Alpaca) on this dataset and get model LLaMA-GPT4(-CN)(-7B).
2nd, it train a reward model based on OPT-1.3B. Due to high cost of labeling comparison dataset and GPT-4's judging quality ability, It use GPT-4 to assign scores [1,10] for different responses for each prompt input.
3rd, to evalute self-instruct tuned model on unseen instructions, it choose 3 instruction following dataset, as User-Oriented-Instructions-252, Vicuna-Instructions-80, Unnatural Instructions.
- It use Amazon Mechanical Turk to perform human evaluation model generation results on User-Oriented-Instructions-252 dataset with a 3H alignment criteria. GPT-4-Instruction-tuned LLaMA-GPT4(-7B) lead to very camparable performance with original GPT-4.
- It use GPT-4 to perform automatic evaluation of different SOTA models on Vicuna-Instructions-80. For each evalution, it ask GPT-4 to rate the response quality between 2 models with score from 1 to 10. LLaMA-GPT4(7B) fine-tuned on GPT-4 outputs works better than Alpaca-13B (fine-tuned on ChatGPT). It shows GPT-4 outputs is much better for instruction tuning.
- Automaic evaluation on chinese Vicuna-Instructions-80 from GPT-4 translation. Vicuna-13B works good as well.
- It perfrom ROUGE-L on unnatural instructions evaluated with 9K samples. It shows LLaMA-GPT4-7B perform better when response is long.
Most importantly, about reward model, in figure 4, it shows 1.3B regression reward model fine-tuned on GPT-4 generated camparison dataset, work just as well as original GPT-4. It shows a very promissing way to perform RLHF and whole 3 step finet-tune like ChatGPT in fulture work.

[PaLM 2 Technical Report, Google, 2023.05]

scaling law: power law to equal proportion(1:1), find out data size is at least as important as model size; data selection and efficient architecture/objectives can improve performace as well; design a more multilingual and diverse pre-training mixture extends across hundreds of languages and domains; build on strong UL2(20B in Paper); largest PaLM 2-L is significant smaller than largest PaLM(-540B) but much better.
scaling law experiment: there is optimal param size at each compute scale, 10^22 FLOPs as 10.7B, 10^21 as 3.35B, 10^20 as 1.04B.
model size: three variants of PaLM 2: a Small (S), Medium (M), and Large (L) version. PaLM 2 refers to the Large version. Blogs says there will be four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn.
Evaluation: six high level categories for academic benchmark: classification and question answering, reasoning, coding, translation and natural language generation. language proficiency exams for human benchmark.
(1) Language proficiency exams(multilingual): PaLM 2 pass all 6 professional language proficiency exams follow the C2 definition, include chinese, japenese, italian, french, spanish, german. Performed generic instruction finetuning without exam contents, pass exams with zero-shot prompting and native human evaluation.
(2) Classification and question answering: dataset commonly used in LLM literature and multilingual capabilities.
(2.1) English QA and classification tasks(one-shot setting)
- Open-domain closed-book question answering tasks: TriviaQA (Joshi et al., 2017), Natural Questions2 (Kwiatkowski et al., 2019), and WebQuestions (Berant et al., 2013)
- Cloze and completion tasks: LAMBADA (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), and StoryCloze (Mostafazadeh et al., 2016)
- Winograd-style tasks: Winograd (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2021)
- Reading comprehension: SQuAD v2 (Rajpurkar et al., 2018) and RACE (Lai et al., 2017)
- Common sense reasoning: PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and OpenBookQA (Mihaylov et al., 2018)
- SuperGLUE (Wang et al., 2019)
- Natural language inference: Adversarial NLI (ANLI; Nie et al., 2020)
(2.2) Multilingual QA (one-shot and no-content setting): TyDi QA (Clark et al., 2020)
(2.3) Multilingual toxicity classification
- Toxicity classification with CivilComments
- Multilingual toxicity classification with Jigsaw Multilingual
(3) Reasoning:
(3.1) representative reasoning datasets in a few-shot setting: WinoGrande (Sakaguchi et al., 2021), ARC-C (Clark et al., 2018), DROP (Dua et al.,2019), StrategyQA (Geva et al., 2021), CommonsenseQA (CSQA; Talmor et al., 2019), XCOPA (Ponti et al., 2020), and BIG-Bench (BB) Hard(23 tasks from 200+, where LLM performed below average human) (Suzgun et al., 2022). competitive with GPT-4.
- Multilingual common sense reasoning: XCOPA
- BIG-Bench (BB) Hard: 23 tasks from 200+, where LLM performed below average human, like multi-step arithmetic problems(multistep_arithmetic)
(3.2) Mathematical reasoning: finetuned on flan dataset (1800 tasks, at least 20 instruction templates per task)
- MATH (Hendrycks et al., 2021), which contains 12,500 problems from high school competitions in 7 mathematics subject areas
- GSM8K (Cobbe et al., 2021), a dataset of 8,500 grade school math word problems
- MGSM (Shi et al., 2023), a multilingual version of GSM8K with translations of a subset of examples into ten typologically diverse languages.
(4) Coding: train the PaLM 2-S model on an extended, code-heavy, heavily multilingual data mixture, resulting model PaLM 2-S*.
- Code Generation: 3 coding datasets: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and ARCADE (Yin et al., 2022), PaLM 2-S* outperforms PaLM-540B-Coder on all benchmarks with few-shot setting.
- Multilingual Evaluation: BabelCode (Orlanski et al., 2023) which translates HumanEval into a variety of other programming languages including c++, java, go, haskell and julia.
(5) Translation
- WMT21 Experimental Setup: automatic metric using BLEURT, human metric using Multidimensional Quality Metrics (MQM) with hired professional translators
- Regional translation experimental setup: FRMT benchmark
- Potential misgendering harms
(6) Natural language generation: ROUGE on 1-shot-learning setting
- XLSum (Hasan et al., 2021), which asks a model to summarize a news article
- WikiLingua (Ladhak et al., 2020), which focuses on generating section headers for step-by-step instructions from WikiHow
- XSum (Narayan et al., 2018), which tasks a model with generating a news article’s first sentence
- Potential harms and bias: ParlAI Dialogue Safety, RealToxicityPrompts, BBQ Bias Benchmark for QA, Multilingual Representational Bias
- Multilingual capabilities: Explaining jokes, Explaining translation ambiguities, Translating into dialects, Expanding abbrevations and fixing typos, Converting formal text into colloquial chat text, Transliterating into new scripts
(7) Memorization

[GPT-4 Technical Report, OpenAI, 2023.03]

no further details about architecture (including model size), hardware, training compute, dataset construction, traning method, or similar.
Multi-modal: accept image and text inputs and produce text outputs.
Academic and professional exams (for human): exhibits human-level performance on the majority of these exams.
traditional NLP benchmark: outperforms previous LLM and system; academic benchmarks: (MMLU, HellaSwag, AI2 Reasoning Challenage(ARC), WinoGrande, HumanEval, DROP, GSM-8K)
HumanEval dataset: log pass rate predictable... for loss predictable...
inverse scaling prize: Hindsight Neglect, GPT-4 reverse the trend.
open sourcing OpenAI Evals: https://github.com/openai/evals
Visual Input: parallel to text-only setting;
hallucinations: GPT-4 reduces hallucinations to GPT-3.5 with 19% point higher on OpenAI Internal evaluations (which contains learning, technology, writing, history, math, science, recommendation, code, business).
TruthfulQA: RLHF post-training GPT-4 is much better than GPT-3.5; Lack knowledge of event after September 2021, majority data cuts off that date.
not fully reliable: hallucinations, limit context windows, do not learn from experience.
bring novel safety challenges:
developed a infrastructure and optimization methods: predictable behavior across multiple scales.
GPT-4 System Card: more than half length of the paper

[Sparks of Artificial General Intelligence: Early experiments with GPT-4, MSFT, 2023.04]

(1) refine: refined over span of a month
(2) Multimodal and interdisciplinary composition: not only does demonstrate a high level of proficiency in different domains such as literature, medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills; understand image and text input and can manipulate text and image in geniue way, not just copy it; does not understand harmony in music.
(3) Code: reason about code execution, simulate the effects of instructions, and explain the results in natural language, even pseudocode;
HumanEval, desciption to code benchmark; Leetcode, 100 sample per level, in first 5 attempts; real world: data visualization, front-end / game development, write code for deep learning, interface with Latex;
understand existing code, reasoning about code execution; executing python code(plugin?);
(4) Mathematical abilities
GSM8K: an elementary school math dataset contains 8000 questions on topics such as arithmetic, fractions, geometry, and word problems;
MATH: a high school math dataset contains 12500 questions on topics such as algebra, calculus, trigonometry, and probability;
MMMLU-STEM: 2000 multiple choice question covering high school and college STEM topics;
specially fine-tuned math model named Minerva, score between text-davinci-003 and GPT-4; GPT-4 have many mistake on MATH due to arithmetic and calculation mistaskes;
Fermi questions: requires both quantitative thinking and general knowledge; don't make much progress;
Higher-Level mathematics: 2022 international mathematic Olympiad;
(5) Real World Interaction: tool use and embodied interaction;
(6) Interaction with humans: successfully passes Sally-Anne test, a classic false-belief test; miscommunication and misunderstanding; explainability;
(7) Discriminative capabilities: different aspect, situations; personally identiable information (PII); text anonymization benchmark (TAB); TruthfulQA, for misconceptions and face-checking;
(8) Limitations: Lack of planning in arithmetic/reasoning problems; long term memory;

OpenAI Research

InstructGPT: [Training language models to follow instructions with human feedback, OpenAI, 2022.03]

GPT3: [Language Models are Few-Shot Learners, OpenAI, 2020.05]

GPT2

GPT1

other research

https://openai.com/research/techniques-for-training-large-neural-networks https://openai.com/research/sparse-transformer https://openai.com/research/measuring-goodharts-law https://openai.com/research/webgpt https://openai.com/research

Prompt Tuning

prompt tuning
prefix tuning
p-tuning
p-tuning-v2

[Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021/01, Stanford]

[The Power of Scale for Parameter-Efficient Prompt Tuning, 2021/09, Google]

conditioning a frozen model with soft prompts; outperform GPT-3's few-shot learning on discrete text prompts on downstream tasks; benifits in robustness to domain transfer and efficient "prompt ensembling".
model tuning/fine-tuning, all model parameter are tuned; prompt design with task description and examples with frozen big models; soft prompt perform much better than prompt design and comparable performance with model tuning when param goes big;
other methods: automate prompt design like search discrete space of words; prefix-tuning backpropagates errors to prefix tensor/activations;
this paper, prompt tuning;

[GPT Understands, Too, 2021/03, Tsinghua, Peking, BAAI]

[P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, 2022/03, Tsinghua, BAAI]

Google Research

T5, Flan-T5

Pathway, UL2, MoE

[LLM are zero-shot rankers for recommender system] [Amazon, textbooks are all you need: learning language representation for sequence recommandation] A new alternative to RLHF just dropped! https://twitter.com/rasbt/status/1663883300522295296 [Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290 ] https://github.com/eric-mitchell/direct-preference-optimization https://github.com/LAION-AI/Open-Assistant/discussions/3347 distilling step by step: outperforming llm with less training data and smaller model size

shm007g / LLaMA-Cult-and-More

Papers #3