mnotgod96 / AppAgent

AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.
https://appagent-official.github.io/
MIT License
4.86k stars 520 forks source link

Comparison with AutoDroid (a similar tool)? #8

Open yuanchun-li opened 8 months ago

yuanchun-li commented 8 months ago

Hello, good work!

I'm one of the authors of AutoDroid, an LLM-based Android task automation approach released several months ago before AppAgent. We did not advertise our work, so it didn't get so much attention as yours. However, I'm glad to see that people are excited about this direction. Thanks for your contribution to the community :)

I noticed that the exploration-augmented method of AppAgent is quite similar to AutoDroid, while AutoDroid is purely based on text (using Vicuna, GPT-3.5, and GPT-4) and AppAgent is based on GPT-4V. I'm curious about the benefits and challenges of using multi-modality models on such UI automation tasks. Have you tested it? Or can you briefly comment on this?

Since I didn't find any comparison with AutoDroid in your paper/github, maybe we can have some open discussion here.

AutoDroid website: https://autodroid-sys.github.io/ AutoDroid paper: https://arxiv.org/abs/2308.15272 AutoDroid code: https://github.com/MobileLLM/AutoDroid

Best, Yuanchun

yuanchun-li commented 8 months ago

BTW, the comparisons with several important prior work (AitW, Auto-UI, CogAgent, etc.) are also missing. They deal with the same task automation problem and are also based on multi-modal models. I think it would be good to give them the credit.