microsoft / UFO

A UI-Focused Agent for Windows OS Interaction.
https://arxiv.org/abs/2402.07939
MIT License
7.17k stars 871 forks source link

Train or fine-tune models for computer automation agents #11

Open James4Ever0 opened 5 months ago

James4Ever0 commented 5 months ago

Hello there Microsoft UFO Team! Excellent work for you to do such remarkable job, bringing AI closer to Windows system. I am doing similar works like training custom GPT2 models on computer automation datasets.

I have created two comprehensive datasets, over terminal and GUI environments. My strategy is to create data by random keyboard and mouse actions, collect observations mixed with other textual datasets.

This naive attempt shows my strong interest over computer agents. I like the idea of GUI agent benchmark systems like WindowsBench, and have thought of building some reward system by program exit codes or VimGolf.

If you ever consider my suggestion useful I would like to hear from your reply! Furthermore, if cooperation is possible I would be thrilled to join your team for building better computer agents!


Update: Google has posted an unsupervised action space training method called Genie. Consider that as highly applicable in the area of computer agents.

vyokky commented 5 months ago

Hi @James4Ever0, thanks for getting in touch. We are defenitely interested in training a local model to enable faster inference. Would you minding sharing more context and perhaps a snippet of the dataset you create? We are welcome to cooperation and contribution if this is a good fit.

James4Ever0 commented 4 months ago

The terminal dataset is comprised of an unique trajectory identifier, observations of the terminal, and actions taken by the agent.

The observation can either be the full view of the terminal or only the updated lines, with line numbers surrounded by square brackets.

The actions taken by the agent is called Godlang, a language which can empower LLM to interface with TUI and GUI.

Preview of the terminal dataset:

====================JSON RESPONSES====================
identifier received from websocket 77bf0b60-056d-4a15-afa4-62431d6ba773
====================JSON RESPONSES====================
Cursur at: (0, 0)
Updated content:
[0 ]
[1 ]
[2 ]
[3 ]
[4 ]
[5 ]
[6 ]
[7 ]
[8 ]
[9 ]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Updated lines: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Fullscreen:
====================JSON RESPONSES====================
Cursur at: (0, 4)
Updated content:
[0 ] / #
Updated lines: 0
Fullscreen:
/ #
VIEW
SPECIAL CTRL+C
SPECIAL TAB
VIEW
SPECIAL CTRL+6
Command list: ['VIEW', 'SPECIAL CTRL+C', 'SPECIAL TAB', 'VIEW', 'SPECIAL CTRL+6']
Regular sleep for 0.200000 seconds
Exiting reading action list because of 'VIEW' command
WAIT 0.548
TYPE n
REM Random actions
James4Ever0 commented 4 months ago

After extracting the RAR file, you will find a bunch of folders named by timestamps, in which you can find these files:

hid_record.jsonl     video_record.mp4        video_timestamps.json
hid_timestamps.json  video_record_script.sh

video_record.mp4 is a video file at 30fps with 1280x768 resolution, in which each frame is a screenshot taken not at the video play speed.

In hid_record.jsonl you shall find:

{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.ctrl"], ["key_press", "Key.shift"], ["key_press", "Key.page_up"], ["key_release", "Key.page_up"], ["key_release", "Key.shift"], ["key_release", "Key.ctrl"]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["mouse_move", [782, 682]]]}
{"HIDEvents": []}
{"HIDEvents": []}
{"HIDEvents": [["key_press", "Key.alt"], ["key_press", "'l'"], ["key_release", "'l'"], ["key_release", "Key.alt"]]}

video_timestamps.json contains the corresponding UNIX timestamps for every frame recorded:

[
    1685664003.6361628,
    1685664003.6745877,
    1685664003.6882446,
    1685664003.715868,
    1685664003.7464304,
    1685664003.7711987,
    1685664003.7833188,
    1685664003.8149195,
    ...
]

hid_timestamps.json is similar to video_timestamps.json and contains every timestamp for every HID action, event, including those empty ones, found in hid_record.jsonl.

James4Ever0 commented 4 months ago

Even though UFO can handle simple UI interfaces like Microsoft Word and Calculator, would it be possible to handle games like Cyberpunk 2077 or complex professional softwares like Premiere Pro and Photoshop? I doubt it and think it needs extensive training datasets, complex training & evaulation regime and advanced algoritms.