Open satchelbaldwin opened 4 months ago
@satchelbaldwin this looks good to me, thanks for writing it up.
For the second bullet--I feel like experimenting with editing the run_code
prompt could make a big difference. Based on that prompt, I'm not sure why run_code
was used to answer my request about updating a parameter value since it isn't in the list of things the tool can do: it's implied that the tool should be used to:
a) answer questions about the kernel state b) double check if code will work before returning it as a final answer
we could be more explicit and say "it can only be used for these things" though I feel like that's too strict. Maybe adding something about "use this for simpler tasks" versus generate_code
for more complex ones (?). We can tune this later.
I think we should consider a flag
on the run_code
tool to enable/disable rollback so the agent can in some cases decide what to do. For example, if it want's to debug it's own code before generating code it would want to roll things back on its own. That said, I was never able to get it in my testing to do the "double check if code will work" thing.
@satchelbaldwin Looks good.
Note that "generate_code" is not currently a "base" tool at the kernel level, but instead is redefined in each context that wants to generate code, while run_code IS a base tool. As part of this we should have freedom to make fundamental changes to the tools, including possibly moving them all to base tools, combining, separating, "subtooling" etc.
Let's sync up before you jump in to the Vue/UI code. There's some as-yet undeployed changes that we should be aware of so we don't paint ourselves in a corner.
Whenever the agent decides to run code, to avoid polluting state or causing accidental or unwanted modifications, the state is rolled back shortly after. This is desirable in many cases, including when code is shown to the user and not executed, or for read-only code execution.
For code execution that is intended to modify state, this behavior needs to support modifications that are not rolled back. To enable this feature, work must be done in a couple places.