Tracking instruction-tuned LLM openness. Paper: Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Tracking Openness, Transparency, and Accountability in Instruction-Tuned Text Generators.” In Proceedings of the 5th International Conference on Conversational User Interfaces. doi:10.1145/3571884.3604316.
The training for the Stable Beluga models was directly inspired by the methodology pioneered by Microsoft in its paper: "Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” While our data generation process is similar, we differ in our data sources.
Our variant of the dataset, containing 600,000 data points (roughly 10% of the dataset size the original Orca paper used), was created synthetically using high-quality instructions from the following datasets created by Enrico Shippole:
"Meet Stable Beluga 1 and Stable Beluga 2, Our Large and Mighty Instruction Fine-Tuned Language Models" https://stability.ai/blog/stable-beluga-large-instruction-fine-tuned-models