LLM Code Execution Improvements: Side Effects and State Modifications

satchelbaldwin commented 4 months ago

Whenever the agent decides to run code, to avoid polluting state or causing accidental or unwanted modifications, the state is rolled back shortly after. This is desirable in many cases, including when code is shown to the user and not executed, or for read-only code execution.

For code execution that is intended to modify state, this behavior needs to support modifications that are not rolled back. To enable this feature, work must be done in a couple places.

Cells need to have support for showing the code that was ran when it happens.
- UI work:Here
Distinguishing what cases run vs. generate code.
run_code without rollback: here that will show the code in the cell as above.
A shallow undo function with checkpointing to rollback on demand with editing capabilities for code that was ran. here
Later, a deeper undo including cells getting moved around and other corner cases.

brandomr commented 4 months ago

@satchelbaldwin this looks good to me, thanks for writing it up.

For the second bullet--I feel like experimenting with editing the run_code prompt could make a big difference. Based on that prompt, I'm not sure why run_code was used to answer my request about updating a parameter value since it isn't in the list of things the tool can do: it's implied that the tool should be used to:

a) answer questions about the kernel state b) double check if code will work before returning it as a final answer

we could be more explicit and say "it can only be used for these things" though I feel like that's too strict. Maybe adding something about "use this for simpler tasks" versus generate_code for more complex ones (?). We can tune this later.

I think we should consider a flag on the run_code tool to enable/disable rollback so the agent can in some cases decide what to do. For example, if it want's to debug it's own code before generating code it would want to roll things back on its own. That said, I was never able to get it in my testing to do the "double check if code will work" thing.

mattprintz commented 4 months ago

@satchelbaldwin Looks good.

Note that "generate_code" is not currently a "base" tool at the kernel level, but instead is redefined in each context that wants to generate code, while run_code IS a base tool. As part of this we should have freedom to make fundamental changes to the tools, including possibly moving them all to base tools, combining, separating, "subtooling" etc.

Let's sync up before you jump in to the Vue/UI code. There's some as-yet undeployed changes that we should be aware of so we don't paint ourselves in a corner.

jataware / beaker-kernel

LLM Code Execution Improvements: Side Effects and State Modifications #53