instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
7 stars 17 forks source link

Consider updating API design for clarity #61

Open markmc opened 4 days ago

markmc commented 4 days ago

Consider the following:

ds = Dataset.from_list(samples)

skills_flow = SynthGroundedSkillsFlow(client, "mixtral", teacher_model).get_flow()
skills_pipe = Pipeline(skills_flow)

sdg = SDG([skills_pipe])
gen_data = sdg.generate(ds)

or:

ds = Dataset.from_list(samples)

mmlu_flow = MMLUBenchFlow(client, teacher_model).get_flow()
mmlu_pipe = Pipeline(mmlu_flow)
knowledge_flow = SynthKnowledgeFlow(client, teacher_model).get_flow()
knowledge_pipe = Pipeline(knowledge_flow)

sdg = SDG([mmlu_pipe, knowledge_pipe])
gen_data = sdg.generate(ds)

Consider the nouns:

Proposals:

  1. Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do
  2. Model Flow as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a render() method make sense to me, and an extensible params object for the common case of instantiating multiple flows
  3. Create a pipeline from a sequence of flows - add a Pipeline.from_flows() convenience class method to Pipeline that knows how to render block configs from a sequence of flows

So we could have e.g.

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = SynthGroundedSkillsFlow(flow_params).render(params)

skills_pipe = Pipeline(block_configs)

gen_data = skills_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

knowledge_pipe = Pipeline(block_configs)

gen_data = knowledge_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows([MMLUBenchFlow, SynthKnowledgeFlow], flow_params)

gen_data = knowledge_pipe.generate(ds)

Resolving this issue would require an update to the design doc and a code change

It would definitely be better to do this before users of the API proliferate

shivchander commented 3 days ago

Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do

While I understand the intent to simplify the system, I believe removing the SDG might not be the best approach for the following reasons:

  1. Separation of Concerns: The SDG serves as a higher-level abstraction that can manage and coordinate multiple pipelines. This separation is important for maintaining clarity in the system design, especially when dealing with complex workflows that involve multiple pipelines. Mixing these responsibilities into a single Pipeline class could reduce the readability and maintainability of the code.

  2. Complex Workflows: In scenarios where we need to coordinate multiple pipelines, the SDG provides a clear and concise way to manage these interactions. It can handle the orchestration of pipelines, ensuring that each one is executed in the correct order and with the correct dependencies. Removing SDG could make it harder to manage these types of workflows effectively.

Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)

shivchander commented 3 days ago
ds = Dataset.from_list(samples)

block_params = BlockConfigParams(client, "mixtral", teacher_model)
flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

I like this proposal to enhance the clarity and convenience

markmc commented 2 days ago

Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do ... Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)

I can totally see that, but consider the YAGNI principle

"Always implement things when you actually need them, never when you just foresee that you [will] need them."

It's very easy to add this concept later when we actually add support for more complex compositions. And at that point, we can consider an API design in the context of a more concrete proposal.

Right now, this concept adds nothing but confusion - e.g. should you chain together MMLUBenchFlow and SynthKnowledgeFlow in a single pipeline, or make each a pipeline and chain them together with SDG? Eiher!

Also, I don't think SDG is a great name. It's not a noun. It doesn't convey its role in terms of composition/orchestration of pipelines. This is the sort of thing we can get right later when we need to add it.

markmc commented 2 days ago
ds = Dataset.from_list(samples)

block_params = BlockConfigParams(client, "mixtral", teacher_model)
flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

I like this proposal to enhance the clarity and convenience

Thank you.

Just a note - the BlockConfigParams line should not be there, that's just a leftover from my editing.

markmc commented 1 hour ago
  1. Model Flow as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a render() method make sense to me, and an extensible params object for the common case of instantiating multiple flows

In #86, flows are now YAML files with block_configs

  1. Create a pipeline from a sequence of flows - add a Pipeline.from_flows() convenience class method to Pipeline that knows how to render block configs from a sequence of flows

In #86, this is now PipelineContext

So it now looks like:

ds = Dataset.from_list(samples)

ctx = PipelineContext(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows(ctx, [MMLU_BENCH_FLOW, SYNTH_KNOWLEDGE_FLOW])

gen_data = knowledge_pipe.generate(ds)