Proposal: AI Assistant API

polywock commented 4 months ago

This new powerful API will be used by AI assistant extensions. Anything the user can do, the assistant should be able to do. This will be a very permissive API, but the benefits will outweigh the risks.

Virtual eyes and ears, the assistant can see/hear your browser windows. This API will be similar to the DesktopCapture / TabCapture API, but less restrictive.
Virtual keyboard, mouse, and touch, the assistant can interact with any part of the browser window. This API might look something like pyautogui. For example, cursor.moveTo(x, y)

By virtual keyboard, mouse, I don't mean controlling the user's mouse/keyboard, the assistant should have their own virtual cursor and keyboard, and be able to collaborate with the user in real time (like Google Docs)

Later features (not necessary for initial API).

The ability to interact with background tabs. This should also be paired with a log system.
The ability to install multiple AI assistants.

You can imagine the accessibility and productivity benefits such an API could bring.

bershanskiy commented 4 months ago

What you are describing is already possible using the current extension system, since extensions can run arbitrary code on pages via content scripts, can inspect DOM, can synthesize events or trigger arbitrary functions, and even capture video feed of pages to perform computer vision analysis on videos.

If you have a specific request about a particular use case not served by the current APIs, please share it.

polywock commented 4 months ago

What you are describing is already possible using the current extension system,

Some of them, but not in the way that's conductive to an AI assistant.

can inspect DOM, ..... or trigger arbitrary functions,

The point is not to parse DOM, but to see and be able to act like a human would.

can synthesize events

Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.

Theoretically, you should be able to play 2-player games with the AI assistant.

even capture video feed of pages to perform computer vision analysis on videos.

I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.

The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.

bershanskiy commented 4 months ago

What you are describing is already possible using the current extension system,

Some of them, but not in the way that's conductive to an AI assistant.

Could you please be more specific? What exact issue are you facing?

can inspect DOM, ..... or trigger arbitrary functions,

The point is not to parse DOM, but to see and be able to act like a human would.

The "act like a human would" part is very abstract and sounds like you are proposing some kind of AGI. In practice, most AI is much simpler and actually has very well defined input and output. DOM tree is a great input for almost any AI.

can synthesize events

Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.

Theoretically, you should be able to play 2-player games with the AI assistant.

Extensions can trivially programmatically forge the isTrusted attribute on the event. In fact, by design, it is rather difficult for a page to actually reliably distinguish programmatic events from the truly user-initiated ones. Some sites are actually foolish enough to try to distinguish the two and end up with lots of false-positives (resulting in not-accessible websites).

If you show me a game, I can trivially construct an API/adapter for AI to use.

even capture video feed of pages to perform computer vision analysis on videos.

I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.

The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.

Extensions already can do this. If you have a specific question, I might be able to answer it.

polywock commented 4 months ago

Yes, technically you might be able to code around this, and code around that. But, I'm proposing a flexible API where you wouldn't need to. The AI will figure it out, the browsers just need to provide the tools.

bershanskiy commented 4 months ago

I'm proposing a flexible API where you wouldn't need to.

Could you be more specific about the API you are proposing?

polywock commented 4 months ago

A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule)
For DesktopCapture / TabCapture, more granular control regarding quality, fps, etc. Eg. it should be fairly straightforward to send a 10fps stream to a server without doing video processing yourself.
A way to have a virtual cursor, be able to move it, click any where on browser, and to touch anywhere. This AI controlled cursor will be visible to the user alongside the user's own cursor.
A way to create a virtual keyboard, press any key combinations.

patrickkettner commented 4 months ago

@polywock all of your suggestions are neat! None of them are actual proposals though. They are rough use cases for APIs that don’t exist, and no one has researched or developed. They are all built on a vague concept of “ai”, when none of the vendors in this group ship a browser with any sort of AI apis today. Your suggestions, while they are neat ideas, are missing the entire fundamental building block of how they would even work. What AI? How are the models provided?

To illustrate my point, I would find it cool if there was a simple API I could use to interact with my home automation. But that isn’t a proposal. A proposal would be something like “create browser.mqtt to expose simplified Iot device pairing”, in which I would go over the suggested structure of the API, the interfaces, methods, and events.

bershanskiy commented 4 months ago

when none of the vendors in this group ship a browser with any sort of AI apis today.

Well, there is Web Neural Network API which has decent polyfils and is fairly close to shipping... but it's a bit off-topic. it is a generic web API to execute (and train) ML models on device via platform-specific native ML APIs. Should be really cool once it actually ships.

The larger point that "None of them are actual proposals" is spot-on.

polywock commented 4 months ago

@patrickkettner

None of them are actual proposals though.

I'm confused about this response. A proposal is a suggestion, what makes the first one not a proposal?

A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule). What more can I say about this?

What AI? How are the models provided?

The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way.

Some weaker models like 7B LLMs will be shipped to the client and run locally.

patrickkettner commented 4 months ago

A proposal is a suggestion, what makes the first one not a suggestions?

No, not in standards parlance. A proposal would look like this - https://github.com/w3c/webextensions/blob/main/proposals/secure-storage.md

On Sat, Feb 10, 2024 at 1:41 PM polywock @.***> wrote:

@patrickkettner https://github.com/patrickkettner

None of them are actual proposals though.

I'm confused about this response. A proposal is a suggestion, what makes the first one not a suggestions?

A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule). What more can I say about this?

What AI? How are the models provided?

The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way.

Some weaker models like 7B LLMs will be shipped to the client and run locally.

— Reply to this email directly, view it on GitHub https://github.com/w3c/webextensions/issues/541#issuecomment-1937092501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRUBWVWKO3BCUZGYRJ5RTYS65OTAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TENJQGE . You are receiving this because you were mentioned.Message ID: @.***>

MasterKia commented 4 months ago

Extensions can trivially programmatically forge the isTrusted attribute on the event.

@bershanskiy This is not true, could you show some examples?

patrickkettner commented 4 months ago

Input events sent via the debugger API are trusted

On Sat, Feb 10, 2024 at 2:06 PM MasterKia @.***> wrote:

Extensions can trivially programmatically forge the isTrusted attribute on the event.

@bershanskiy https://github.com/bershanskiy This is no true, could you show some examples?

— Reply to this email directly, view it on GitHub https://github.com/w3c/webextensions/issues/541#issuecomment-1937097747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRUBRY6UUM22QK4LN3TDDYS7ALPAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TONZUG4 . You are receiving this because you were mentioned.Message ID: @.***>

bershanskiy commented 4 months ago

Extensions can trivially programmatically forge the isTrusted attribute on the event.

@bershanskiy This is no true, could you show some examples?

There are a few ways:

As stated above, debugger API creates trusted events by design. As a mitigation for security/privacy concerns, it does require a debugging session (debugging sidebar or tab) and produces an inconvenient warning at the top of the page.
Without any extra warnings besides host permissions, extension can run a script on document_start (before any other script is run) and shim every relevant call to .addEventListener to actually construct a synthetic object which will be just similar enough to a real trusted Event that it will pass a specific set of checks
Content script can transplant event listeners around and chain them in weird ways to trigger multiple event listeners by the same real trusted Event.

If you plan to publish extensions utilizing these techniques I would recommend being upfront about why you need to use them since these are odd things to do in a normal extension.

polywock commented 4 months ago

@bershanskiy Although nice to know, they're hardly valid substitutes for the API I'm proposing. I would also disagree that they're trivial.

To expand a bit more on my proposal. It doesn't need to offer complex Keyboard/Mouse capabilities like Selenium's Action API. More basic actions should be viable. These should get us 95% of the way there.

Cursor API

browser.aiAssistant.createCursor() 
Cursor.moveBy(x, y)
Cursor.moveTo(x, y) // relative to browser window viewport 
Cursor.press(...buttons) 
Cursor.release(...buttons) 
Cursor.scroll(by)
Cursor.getState() // current coordinates, which buttons are pressed down, etc.

Keyboard API

browser.aiAssistant.createKeyboard() 
Keyboard.press(...keys)
Keyboard.release(...keys)
Keyboard.getState() // which keys are pressed down

Vision and Audio API: Pretty much just TabCapture/DesktopCapture API without restrictions, and more granular options.

polywock commented 4 months ago

@bershanskiy

Content script can transplant event listeners around and chain them in weird ways to trigger multiple event listeners by the same real trusted Event.

I tested this out, and it doesn't seem to be the case. If you dispatch a trusted event manually, it's trusted status resets back to false.

dotproto commented 3 months ago

Closing this as the request isn't specific enough for us to meaningfully discuss it.

w3c / webextensions

Proposal: AI Assistant API #541