Vision Agent v3 - Githubissues

A couple of pros and cons with Data Interpreter and Agent Coder lead to this design.

Data Interpreter Cons:

The subtasks are too small, it will take tasks that GPT-4o can handle fine, and subdivide them into 5-7 smaller tasks that are unnecessarily small
When the subtasks are small, it tends to overwrite correct code a lot. For example it will write correct code for subtask 1, and then remove it by subtask 5

Data Interpreter Pros:

Tool recommendation helps scale to more tools
Long term memory helps the agent framework "learn" to some degree
Planning helps with tool choice and task decomposition if it's executed in the correct way

Agent Coder Pros:

Does very well on coding tasks (first place on human eval)

Agent Coder Cons:

Can't plan long term
No tool recommendation, needs all tools at once
Nothing like long term memory to help it "learn"

This version of vision agent basically keeps planning, but does the entire plan in one shot using the Agent Coder framework. It uses the plan to do tool recommendation and also allows for long term memory lookup. For planning beyond the initial plan it will do reflection and see if it needs to execute an additional plan.

landing-ai / vision-agent

Vision Agent v3 #89