Open markmc opened 4 days ago
Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do
While I understand the intent to simplify the system, I believe removing the SDG might not be the best approach for the following reasons:
Separation of Concerns: The SDG serves as a higher-level abstraction that can manage and coordinate multiple pipelines. This separation is important for maintaining clarity in the system design, especially when dealing with complex workflows that involve multiple pipelines. Mixing these responsibilities into a single Pipeline class could reduce the readability and maintainability of the code.
Complex Workflows: In scenarios where we need to coordinate multiple pipelines, the SDG provides a clear and concise way to manage these interactions. It can handle the orchestration of pipelines, ensuring that each one is executed in the correct order and with the correct dependencies. Removing SDG could make it harder to manage these types of workflows effectively.
Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)
ds = Dataset.from_list(samples)
block_params = BlockConfigParams(client, "mixtral", teacher_model)
flow_params = FlowParams(client, "mixtral", teacher_model)
block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())
I like this proposal to enhance the clarity and convenience
Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do ... Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)
I can totally see that, but consider the YAGNI principle
"Always implement things when you actually need them, never when you just foresee that you [will] need them."
It's very easy to add this concept later when we actually add support for more complex compositions. And at that point, we can consider an API design in the context of a more concrete proposal.
Right now, this concept adds nothing but confusion - e.g. should you chain together MMLUBenchFlow
and SynthKnowledgeFlow
in a single pipeline, or make each a pipeline and chain them together with SDG
? Eiher!
Also, I don't think SDG
is a great name. It's not a noun. It doesn't convey its role in terms of composition/orchestration of pipelines. This is the sort of thing we can get right later when we need to add it.
ds = Dataset.from_list(samples) block_params = BlockConfigParams(client, "mixtral", teacher_model) flow_params = FlowParams(client, "mixtral", teacher_model) block_configs = MMLUBenchFlow(flow_params).render() block_configs.extend(SynthKnowledgeFlow(flow_params).render())
I like this proposal to enhance the clarity and convenience
Thank you.
Just a note - the BlockConfigParams
line should not be there, that's just a leftover from my editing.
- Model
Flow
as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - arender()
method make sense to me, and an extensibleparams
object for the common case of instantiating multiple flows
In #86, flows are now YAML files with block_configs
- Create a pipeline from a sequence of flows - add a
Pipeline.from_flows()
convenience class method to Pipeline that knows how to render block configs from a sequence of flows
In #86, this is now PipelineContext
So it now looks like:
ds = Dataset.from_list(samples) ctx = PipelineContext(client, "mixtral", teacher_model) knowledge_pipe = Pipeline.from_flows(ctx, [MMLU_BENCH_FLOW, SYNTH_KNOWLEDGE_FLOW]) gen_data = knowledge_pipe.generate(ds)
Consider the following:
or:
Consider the nouns:
generate()
method transforms an input dataset and returns an output datasetgenerate()
method in which it instantiates and invokes blocks in turn, passing the input dataset and collecting the outputgenerate()
method calls pipelines in turnProposals:
SDG
- we don't need bothSDG
andPipeline
sincePipeline
can already do everythingSDG
can doFlow
as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - arender()
method make sense to me, and an extensibleparams
object for the common case of instantiating multiple flowsPipeline.from_flows()
convenience class method to Pipeline that knows how to render block configs from a sequence of flowsSo we could have e.g.
or:
or:
Resolving this issue would require an update to the design doc and a code change
It would definitely be better to do this before users of the API proliferate