Gather data on model performance per SDG Pipeline

Feature Overview (mandatory - Complete while in New status)

The various InstructLab experiences use different SDG pipelines, which, even when fine-tuning a full-resolution model, the performance & quality of the resulting fine-tuned models are different.

This card is for creating an evaluation flow that takes SDG generated by the three default pipelines, fine-tunes a full-resolution model with them, and evaluates the performance of the fine-tuned model.

Goals (mandatory - Complete while in New status)

Provide quantitative evidence of the model performance impact based on the SDG pipeline used to fine-tune the model.

Requirements (mandatory -_ Complete while in Refinement status):

Generate SDG using the three default pipelines: laptop (a simplified self-instruct) (pipeline=simple) upstream (SDG 1.0) (pipeline=full) downstream RHEL AI (SDG 1.5) (pipeline=agentic) Use the multi-stage agentic fine-tuning pipeline to generate a fine-tune model for each SDG pipeline Evaluate each of the resulting models on the domain-specific knowledge (e.g. MMLU_branch) Use the task-dir from the agentic pipeline? Note: Consider repeating the experiment with various distinct runs of each SDG pipeline to identify the expected range or an average number of the performance differences.

Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

Provide a report on model performance differences of the three default pipelines Provide a pipeline or scripts users can execute on-premise should they want to replicate the evaluation for their use cases

Tasks/Epics Tracker:

[ ] Link issues / PRs here

instructlab / sdg

Gather data on model performance per SDG Pipeline #313