Title: "Yi Model Family: Powerful Multi-Dimensional Language and Multimodal Models"
Description:
"We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models."
Contents
Introduction
Pretraining
Finetuning
Infrastructure
Safety
Evaluations
Capability Extension
Final Discussions
1 Introduction
Recent breakthroughs in large language models have revolutionized the whole field of artificial intelligence and potentially radiate across the entire human society. Our vision for large language models is to make them the next generation computational platform and empower the whole community with significantly amplified intelligence. As a step towards this mission, we present the Yi model series, 6B and 34B language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data. Due to the data quality resulting from our substantial engineering efforts, which we will detail in the upcoming sections, Yi achieves near GPT-3.5 benchmark scores and human preferences.
In designing the Yi model series, we are mostly concerned on the following dimensions regarding model scale, data scale, and data quality: (1). when choosing model scale, the desiderata is to have small enough model that is feasible for inference on consumer-grade hardware like the RTX 4090 where the bounding factor is its limited 24G memory, yet still large enough with complex reasoning and emergent abilities. This is why we found 34B gives a nice performance-cost balance; (2). since 34B is smaller than the conventional 70B used by Chinchilla [30] and LLaMA [77], we increase the pretrain data scale to 3.1T tokens to compensate for the decreased compute flops. This makes the model-data scale combination fall into the post Chinchilla optimal regime [64], i.e., we overtrain the model on more tokens (3T) than the compute optimal (around 1T). The benefit is from the inference side, as we achieve stronger performance with reduced serving cost: after int4 [81] quantization, one can serve the 34B chat model on 24G GPU memory with almost no performance drop; (3). our data engineering principle is to promote quality over quantity for both pretraining and finetuning. The pretraining data quality is guaranteed by a sophisticated data cleaning pipeline with cascaded filtering methods and intentionally increased deduplication strength; (4). for finetuning data we heavily emphasize quality by handcrafting less than 10K instructions over multiple iterations based on user feedback. This approach significantly deviates from the quantity-scaling styled instruction tuning works like FLAN [9] and UltraChat [19], but aligns more with the handcrafting styled works like LIMA [94].
Our pretraining data cleaning system features a sophisticated filtering pipeline based on language, heuristic textual features, perplexity, semantics, topic, and safety, as well as a cascaded deduplication process based on paragraph, MinHash, and exact matching. This thorough pipeline leads to a much higher removal ratio than existing pipelines like CCNet [80], RefinedWeb [56] and RedPajama [13], which we believe is key to the success of data engineering. The underlying principle is although pretraining requires data scaling, one would like to make sure the data used are of high quality, rather than training the model on large raw data, i.e., we prefer 3T tokens over sophasticated engineering over 10T tokens without extensive filtering. Regarding the model architecture, we use standard implementation of the Transformer architecture with Grouped-Query Attention (GQA) [1], SwiGLU [68] activation, and RoPE with an adjusted base frequency (RoPE ABF) [82]. This design choice is the standard approach rooted from the Transformer original paper [78], later modified by GPT-3 and Chinchilla [30], then followed by LLaMA [77], Baichuan [84], Qwen [3] and many related works.
To approach GPT-3.5-matching human preferences, our finetuning dataset is curated from carefully selected multi-turn instruction-response pairs, annotated directly by our team of machine learning engineers then polished over multiple iterations of user feedback. As mentioned above, the size of our finetuning dataset is less than 10K, but improved over and over again across the model development timeline. Benefiting from the dataset's manageable size, we employed an extensive grid search to identify the optimal data composition, promote diversity, and discover effective hyperparameters. After 8-bit and 4-bit quantization, the final chat model can be deployed on consumer-grade GPUs nearly without performance degradation compared to the bf16 format.
We further extend the Yi model capability from three dimensions: context scaling, vision-language adaptation, and depth-upscaling. To achive 200K context length, we continue pretrain the model on about 5B length-upsampled data, similar to the concurrent work in Fu et al. [22]. To adapt the model to vision-language tasks, we integrate a vision encoder and develop a multi-stage training method, following and improving the practice of Liu et al. [47]. We also study the effectiveness of depth-upscailng [38], i.e., making the model deeper by continual pretraining, and confirming its effectiveness to further improve model performance.
Our infrastructure provides strong support for the full-stack development of the Yi model series, from pretraining to finetuning to serving. To support pretraining, we develop cross-cloud elastic task scheduling, automatic failure recovery, and topology-aware resource allocation which collectively enable us to run tasks according to the real-time available GPU nodes cross clusters with limited switching overhead. To support finetuning, we build a hierarchical scheduling framework supporting different distributed backends for different models (e.g., Megatron [70] for the policy model and DeepSpeed [60] for the reward model). For efficient inference, we use 4-bit model and 8-bit KV cache quantization, combining with PagedAttention [41] and Dynamic Batching.
Extensive experiments demonstrate that Yi-34B can match GPT-3.5 in both performance and efficiency. On most standard benchmarks like MMLU [27] (for the base model) and LMSys ELO Rating [93] (for the chat model), Yi-34B generally achieves scores on par with GPT-3.5. After model parameter and KV cache quantization, the inference cost is also controlled such that a wide range of the community can deploy the model on cost effective devices. We further report a detailed performance comparison between Yi and major LLMs on commonsense reasoning, college exams, math, coding, reading comprehension, and human preference win-rate on multiple evaluation benchmarks.
Since its release, the Yi model series has benefited the community from the following perspectives: (1). it provides GPT-3.5-matching quality yet cost-effective models to researchers, and enables developers to build AI-native applications like language model based agents; (2). it empowers end users with locally runnable chatbots, which consequently helps protecting user data privacy; (3). it sheds light on the direction on further data and model scaling to achieve even stronger frontier models. for both research and commercial use.
{'label-name': 'large-language-models', 'label-description': 'Models like the Yi series with significant scale and capabilities in language and multimodal tasks.', 'confidence': 57.09}
Title: "Yi Model Family: Powerful Multi-Dimensional Language and Multimodal Models"
Description:
"We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models."
Contents
1 Introduction
Recent breakthroughs in large language models have revolutionized the whole field of artificial intelligence and potentially radiate across the entire human society. Our vision for large language models is to make them the next generation computational platform and empower the whole community with significantly amplified intelligence. As a step towards this mission, we present the Yi model series, 6B and 34B language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data. Due to the data quality resulting from our substantial engineering efforts, which we will detail in the upcoming sections, Yi achieves near GPT-3.5 benchmark scores and human preferences.
In designing the Yi model series, we are mostly concerned on the following dimensions regarding model scale, data scale, and data quality: (1). when choosing model scale, the desiderata is to have small enough model that is feasible for inference on consumer-grade hardware like the RTX 4090 where the bounding factor is its limited 24G memory, yet still large enough with complex reasoning and emergent abilities. This is why we found 34B gives a nice performance-cost balance; (2). since 34B is smaller than the conventional 70B used by Chinchilla [30] and LLaMA [77], we increase the pretrain data scale to 3.1T tokens to compensate for the decreased compute flops. This makes the model-data scale combination fall into the post Chinchilla optimal regime [64], i.e., we overtrain the model on more tokens (3T) than the compute optimal (around 1T). The benefit is from the inference side, as we achieve stronger performance with reduced serving cost: after int4 [81] quantization, one can serve the 34B chat model on 24G GPU memory with almost no performance drop; (3). our data engineering principle is to promote quality over quantity for both pretraining and finetuning. The pretraining data quality is guaranteed by a sophisticated data cleaning pipeline with cascaded filtering methods and intentionally increased deduplication strength; (4). for finetuning data we heavily emphasize quality by handcrafting less than 10K instructions over multiple iterations based on user feedback. This approach significantly deviates from the quantity-scaling styled instruction tuning works like FLAN [9] and UltraChat [19], but aligns more with the handcrafting styled works like LIMA [94].
Our pretraining data cleaning system features a sophisticated filtering pipeline based on language, heuristic textual features, perplexity, semantics, topic, and safety, as well as a cascaded deduplication process based on paragraph, MinHash, and exact matching. This thorough pipeline leads to a much higher removal ratio than existing pipelines like CCNet [80], RefinedWeb [56] and RedPajama [13], which we believe is key to the success of data engineering. The underlying principle is although pretraining requires data scaling, one would like to make sure the data used are of high quality, rather than training the model on large raw data, i.e., we prefer 3T tokens over sophasticated engineering over 10T tokens without extensive filtering. Regarding the model architecture, we use standard implementation of the Transformer architecture with Grouped-Query Attention (GQA) [1], SwiGLU [68] activation, and RoPE with an adjusted base frequency (RoPE ABF) [82]. This design choice is the standard approach rooted from the Transformer original paper [78], later modified by GPT-3 and Chinchilla [30], then followed by LLaMA [77], Baichuan [84], Qwen [3] and many related works.
To approach GPT-3.5-matching human preferences, our finetuning dataset is curated from carefully selected multi-turn instruction-response pairs, annotated directly by our team of machine learning engineers then polished over multiple iterations of user feedback. As mentioned above, the size of our finetuning dataset is less than 10K, but improved over and over again across the model development timeline. Benefiting from the dataset's manageable size, we employed an extensive grid search to identify the optimal data composition, promote diversity, and discover effective hyperparameters. After 8-bit and 4-bit quantization, the final chat model can be deployed on consumer-grade GPUs nearly without performance degradation compared to the bf16 format.
We further extend the Yi model capability from three dimensions: context scaling, vision-language adaptation, and depth-upscaling. To achive 200K context length, we continue pretrain the model on about 5B length-upsampled data, similar to the concurrent work in Fu et al. [22]. To adapt the model to vision-language tasks, we integrate a vision encoder and develop a multi-stage training method, following and improving the practice of Liu et al. [47]. We also study the effectiveness of depth-upscailng [38], i.e., making the model deeper by continual pretraining, and confirming its effectiveness to further improve model performance.
Our infrastructure provides strong support for the full-stack development of the Yi model series, from pretraining to finetuning to serving. To support pretraining, we develop cross-cloud elastic task scheduling, automatic failure recovery, and topology-aware resource allocation which collectively enable us to run tasks according to the real-time available GPU nodes cross clusters with limited switching overhead. To support finetuning, we build a hierarchical scheduling framework supporting different distributed backends for different models (e.g., Megatron [70] for the policy model and DeepSpeed [60] for the reward model). For efficient inference, we use 4-bit model and 8-bit KV cache quantization, combining with PagedAttention [41] and Dynamic Batching.
Extensive experiments demonstrate that Yi-34B can match GPT-3.5 in both performance and efficiency. On most standard benchmarks like MMLU [27] (for the base model) and LMSys ELO Rating [93] (for the chat model), Yi-34B generally achieves scores on par with GPT-3.5. After model parameter and KV cache quantization, the inference cost is also controlled such that a wide range of the community can deploy the model on cost effective devices. We further report a detailed performance comparison between Yi and major LLMs on commonsense reasoning, college exams, math, coding, reading comprehension, and human preference win-rate on multiple evaluation benchmarks.
Since its release, the Yi model series has benefited the community from the following perspectives: (1). it provides GPT-3.5-matching quality yet cost-effective models to researchers, and enables developers to build AI-native applications like language model based agents; (2). it empowers end users with locally runnable chatbots, which consequently helps protecting user data privacy; (3). it sheds light on the direction on further data and model scaling to achieve even stronger frontier models. for both research and commercial use.
URL
https://arxiv.org/html/2403.04652v1
Suggested labels
{'label-name': 'large-language-models', 'label-description': 'Models like the Yi series with significant scale and capabilities in language and multimodal tasks.', 'confidence': 57.09}