General tasks, work items, and issues, related to solving complex tasks, as defined by the GAIA benchmark, and executed by AutoGenBench (#973).
### GAIA Refinement (things to fix from the March 1st run)
- [x] Check if files are too large before putting them into the initial question prompt. If too large, instruct to open with web_surfer using a file:/// URI.
- [x] Prevent or greatly discourage parallel tool use in WebSurfer. Through prompting or code, ensure only one function is called at a time.
- [x] Ensure that, when code is written by Assistant, it gets a chance to run in Computer_Terminal (hardcode this!, it has been stubbornly hard to fix with prompting)
- [ ] Ensure when Computer_Terminal throws a ModuleNotFoundError, the assistant responds with instructions to do a pip install ((hardcode this!, it has been stubbornly hard to fix with prompting)
- [x] Update Computer_Terminal's default response to indicate that the proper way to address it is by providing it code to run in markdown code blocks.
- [x] Create a 50 or 60-question subset of GAIA, across levels, for rapid iteration (e.g., 20, 20, 10, levels 1, 2, and 3 problems)
- [ ] Complete and merge #1682
- [x] Navigational search doesn't work when links have parentheses in them. Fix regex here: https://github.com/microsoft/autogen/blob/63a01753114696ed51f9f7740d0f0fcc33c0c8b8/autogen/agentchat/contrib/web_surfer.py#L157-L159
- [x] Filter data URIs from Markdown
### Immediate Priority
- [ ] Update SocietyOfMind template to use the latest AutoGen features like initiate_chats and iPython executors
- [ ] https://github.com/microsoft/autogen/issues/1630
- [ ] https://github.com/microsoft/autogen/issues/1671
### Longer-Term Priority
- [x] Re-add "find-in-page" functionality to WebSurferAgent
- [ ] https://github.com/microsoft/autogen/issues/1670
- [ ] Develop a (possibly static, possibly internal) dashboard for tracking progress.
- [ ] https://github.com/microsoft/autogen/issues/1550
- [ ] Evaluate and Track SLMS (7B, 13B) Models and Report Results
- [ ] Support Workflows from AutoGen Studio in AutoGen Bench (Users create and tweak their agent workflows and can automatically run it on autogenbench)
### Other Tasks
- [ ] Onboard other complex task benchmarks like [webarena](https://webarena.dev/), [miniWoB++](https://github.com/Farama-Foundation/miniwob-plusplus), [Mind2Web](https://osu-nlp-group.github.io/Mind2Web/)
Completed
Click to expand
```[tasklist]
### Completed tasks
- [x] Create a script for collating GAIA runs into the JSON format needed to submit to the leaderboard.
- [x] Create a shared folder, and establish a convention, for sharing latest run logs etc.
- [x] Create a protected branch of AutoGen (e.g., `complex_tasks`) for coordinating experiential work on complex tasks.
- [ ] https://github.com/microsoft/autogen/issues/1368
- [ ] https://github.com/microsoft/autogen/issues/1477
- [ ] https://github.com/microsoft/autogen/issues/1488
- [ ] https://github.com/microsoft/autogen/issues/1489
- [x] Produce a full set of validation benchmark numbers. Not just level 1.
- [ ] https://github.com/microsoft/autogen/issues/1562
- [ ] https://github.com/microsoft/autogen/issues/1563
- [ ] https://github.com/microsoft/autogen/issues/1549
- [ ] https://github.com/microsoft/autogen/issues/1478
- [x] Experiment with alternate GroupChat prompts (like GroupChat moderator, or perhaps graph-based GroupChat)
- [x] Get WebSurfer to make more use of `answer_from_page` or `summarize_page` rather than repeated calls to `page_down`
- [x] Get the main GAIA agent to be more willing to output an educated "guess" of the final answer rather than just saying "unable to determine"
```
General tasks, work items, and issues, related to solving complex tasks, as defined by the GAIA benchmark, and executed by AutoGenBench (#973).
Completed
Click to expand
```[tasklist] ### Completed tasks - [x] Create a script for collating GAIA runs into the JSON format needed to submit to the leaderboard. - [x] Create a shared folder, and establish a convention, for sharing latest run logs etc. - [x] Create a protected branch of AutoGen (e.g., `complex_tasks`) for coordinating experiential work on complex tasks. - [ ] https://github.com/microsoft/autogen/issues/1368 - [ ] https://github.com/microsoft/autogen/issues/1477 - [ ] https://github.com/microsoft/autogen/issues/1488 - [ ] https://github.com/microsoft/autogen/issues/1489 - [x] Produce a full set of validation benchmark numbers. Not just level 1. - [ ] https://github.com/microsoft/autogen/issues/1562 - [ ] https://github.com/microsoft/autogen/issues/1563 - [ ] https://github.com/microsoft/autogen/issues/1549 - [ ] https://github.com/microsoft/autogen/issues/1478 - [x] Experiment with alternate GroupChat prompts (like GroupChat moderator, or perhaps graph-based GroupChat) - [x] Get WebSurfer to make more use of `answer_from_page` or `summarize_page` rather than repeated calls to `page_down` - [x] Get the main GAIA agent to be more willing to output an educated "guess" of the final answer rather than just saying "unable to determine" ```