microsoft / autogen

A programming framework for agentic AI 🤖
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
31.63k stars 4.6k forks source link

[Roadmap]: Complex Tasks Work Items (GAIA) #1369

Closed afourney closed 1 month ago

afourney commented 8 months ago

General tasks, work items, and issues, related to solving complex tasks, as defined by the GAIA benchmark, and executed by AutoGenBench (#973).

### GAIA Refinement (things to fix from the March 1st run)
- [x] Check if files are too large before putting them into the initial question prompt. If too large, instruct to open with web_surfer using a file:/// URI.
- [x] Prevent or greatly discourage parallel tool use in WebSurfer. Through prompting or code, ensure only one function is called at a time.
- [x] Ensure that, when code is written by Assistant, it gets a chance to run in Computer_Terminal (hardcode this!, it has been stubbornly hard to fix with prompting)
- [ ] Ensure when Computer_Terminal throws a ModuleNotFoundError, the assistant responds with instructions to do a pip install ((hardcode this!, it has been stubbornly hard to fix with prompting)
- [x] Update Computer_Terminal's default response to indicate that the proper way to address it is by providing it code to run in markdown code blocks.
- [x] Create a 50 or 60-question subset of GAIA, across levels, for rapid iteration (e.g., 20, 20, 10, levels 1, 2, and 3 problems)
- [ ] Complete and merge #1682
- [x] Navigational search doesn't work when links have parentheses in them. Fix regex here: https://github.com/microsoft/autogen/blob/63a01753114696ed51f9f7740d0f0fcc33c0c8b8/autogen/agentchat/contrib/web_surfer.py#L157-L159
- [x] Filter data URIs from Markdown
### Immediate Priority
- [ ] Update SocietyOfMind template to use the latest AutoGen features like initiate_chats and iPython executors 
- [ ] https://github.com/microsoft/autogen/issues/1630
- [ ] https://github.com/microsoft/autogen/issues/1671
### High Priority
- [ ] https://github.com/microsoft/autogen/issues/1481
- [ ] https://github.com/microsoft/autogen/issues/1765
### Longer-Term Priority
- [x] Re-add "find-in-page" functionality to WebSurferAgent
- [ ] https://github.com/microsoft/autogen/issues/1670
- [ ] Develop a (possibly static, possibly internal) dashboard for tracking progress.
- [ ] https://github.com/microsoft/autogen/issues/1550
- [ ] Evaluate and Track SLMS (7B, 13B) Models and Report Results
- [ ] Support Workflows from AutoGen Studio in AutoGen Bench (Users create and tweak their agent workflows and can automatically run it on autogenbench)
### Other Tasks
- [ ] Onboard other complex task benchmarks like [webarena](https://webarena.dev/), [miniWoB++](https://github.com/Farama-Foundation/miniwob-plusplus), [Mind2Web](https://osu-nlp-group.github.io/Mind2Web/)  

Completed

Click to expand ```[tasklist] ### Completed tasks - [x] Create a script for collating GAIA runs into the JSON format needed to submit to the leaderboard. - [x] Create a shared folder, and establish a convention, for sharing latest run logs etc. - [x] Create a protected branch of AutoGen (e.g., `complex_tasks`) for coordinating experiential work on complex tasks. - [ ] https://github.com/microsoft/autogen/issues/1368 - [ ] https://github.com/microsoft/autogen/issues/1477 - [ ] https://github.com/microsoft/autogen/issues/1488 - [ ] https://github.com/microsoft/autogen/issues/1489 - [x] Produce a full set of validation benchmark numbers. Not just level 1. - [ ] https://github.com/microsoft/autogen/issues/1562 - [ ] https://github.com/microsoft/autogen/issues/1563 - [ ] https://github.com/microsoft/autogen/issues/1549 - [ ] https://github.com/microsoft/autogen/issues/1478 - [x] Experiment with alternate GroupChat prompts (like GroupChat moderator, or perhaps graph-based GroupChat) - [x] Get WebSurfer to make more use of `answer_from_page` or `summarize_page` rather than repeated calls to `page_down` - [x] Get the main GAIA agent to be more willing to output an educated "guess" of the final answer rather than just saying "unable to determine" ```
afourney commented 8 months ago

When items are converted into issues, it means that work is in progress.