Closed polywock closed 3 months ago
What you are describing is already possible using the current extension system, since extensions can run arbitrary code on pages via content scripts, can inspect DOM, can synthesize events or trigger arbitrary functions, and even capture video feed of pages to perform computer vision analysis on videos.
If you have a specific request about a particular use case not served by the current APIs, please share it.
What you are describing is already possible using the current extension system,
Some of them, but not in the way that's conductive to an AI assistant.
can inspect DOM, ..... or trigger arbitrary functions,
The point is not to parse DOM, but to see and be able to act like a human would.
can synthesize events
Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.
Theoretically, you should be able to play 2-player games with the AI assistant.
even capture video feed of pages to perform computer vision analysis on videos.
I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.
The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.
What you are describing is already possible using the current extension system,
Some of them, but not in the way that's conductive to an AI assistant.
Could you please be more specific? What exact issue are you facing?
can inspect DOM, ..... or trigger arbitrary functions,
The point is not to parse DOM, but to see and be able to act like a human would.
The "act like a human would" part is very abstract and sounds like you are proposing some kind of AGI. In practice, most AI is much simpler and actually has very well defined input and output. DOM tree is a great input for almost any AI.
can synthesize events
Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.
Theoretically, you should be able to play 2-player games with the AI assistant.
Extensions can trivially programmatically forge the isTrusted
attribute on the event. In fact, by design, it is rather difficult for a page to actually reliably distinguish programmatic events from the truly user-initiated ones. Some sites are actually foolish enough to try to distinguish the two and end up with lots of false-positives (resulting in not-accessible websites).
If you show me a game, I can trivially construct an API/adapter for AI to use.
even capture video feed of pages to perform computer vision analysis on videos.
I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.
The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.
Extensions already can do this. If you have a specific question, I might be able to answer it.
Yes, technically you might be able to code around this, and code around that. But, I'm proposing a flexible API where you wouldn't need to. The AI will figure it out, the browsers just need to provide the tools.
I'm proposing a flexible API where you wouldn't need to.
Could you be more specific about the API you are proposing?
@polywock all of your suggestions are neat! None of them are actual proposals though. They are rough use cases for APIs that don’t exist, and no one has researched or developed. They are all built on a vague concept of “ai”, when none of the vendors in this group ship a browser with any sort of AI apis today. Your suggestions, while they are neat ideas, are missing the entire fundamental building block of how they would even work. What AI? How are the models provided?
To illustrate my point, I would find it cool if there was a simple API I could use to interact with my home automation. But that isn’t a proposal. A proposal would be something like “create browser.mqtt to expose simplified Iot device pairing”, in which I would go over the suggested structure of the API, the interfaces, methods, and events.
when none of the vendors in this group ship a browser with any sort of AI apis today.
Well, there is Web Neural Network API which has decent polyfils and is fairly close to shipping... but it's a bit off-topic. it is a generic web API to execute (and train) ML models on device via platform-specific native ML APIs. Should be really cool once it actually ships.
The larger point that "None of them are actual proposals" is spot-on.
@patrickkettner
None of them are actual proposals though.
I'm confused about this response. A proposal is a suggestion, what makes the first one not a proposal?
What AI? How are the models provided?
The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way.
Some weaker models like 7B LLMs will be shipped to the client and run locally.
A proposal is a suggestion, what makes the first one not a suggestions?
No, not in standards parlance. A proposal would look like this - https://github.com/w3c/webextensions/blob/main/proposals/secure-storage.md
On Sat, Feb 10, 2024 at 1:41 PM polywock @.***> wrote:
@patrickkettner https://github.com/patrickkettner
None of them are actual proposals though.
I'm confused about this response. A proposal is a suggestion, what makes the first one not a suggestions?
- A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule). What more can I say about this?
What AI? How are the models provided?
The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way.
Some weaker models like 7B LLMs will be shipped to the client and run locally.
— Reply to this email directly, view it on GitHub https://github.com/w3c/webextensions/issues/541#issuecomment-1937092501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRUBWVWKO3BCUZGYRJ5RTYS65OTAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TENJQGE . You are receiving this because you were mentioned.Message ID: @.***>
Extensions can trivially programmatically forge the isTrusted attribute on the event.
@bershanskiy This is not true, could you show some examples?
Input events sent via the debugger API are trusted
On Sat, Feb 10, 2024 at 2:06 PM MasterKia @.***> wrote:
Extensions can trivially programmatically forge the isTrusted attribute on the event.
@bershanskiy https://github.com/bershanskiy This is no true, could you show some examples?
— Reply to this email directly, view it on GitHub https://github.com/w3c/webextensions/issues/541#issuecomment-1937097747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRUBRY6UUM22QK4LN3TDDYS7ALPAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TONZUG4 . You are receiving this because you were mentioned.Message ID: @.***>
Extensions can trivially programmatically forge the isTrusted attribute on the event.
@bershanskiy This is no true, could you show some examples?
There are a few ways:
document_start
(before any other script is run) and shim every relevant call to .addEventListener
to actually construct a synthetic object which will be just similar enough to a real trusted Event
that it will pass a specific set of checksEvent
.If you plan to publish extensions utilizing these techniques I would recommend being upfront about why you need to use them since these are odd things to do in a normal extension.
@bershanskiy Although nice to know, they're hardly valid substitutes for the API I'm proposing. I would also disagree that they're trivial.
To expand a bit more on my proposal. It doesn't need to offer complex Keyboard/Mouse capabilities like Selenium's Action API. More basic actions should be viable. These should get us 95% of the way there.
Cursor API
browser.aiAssistant.createCursor()
Cursor.moveBy(x, y)
Cursor.moveTo(x, y) // relative to browser window viewport
Cursor.press(...buttons)
Cursor.release(...buttons)
Cursor.scroll(by)
Cursor.getState() // current coordinates, which buttons are pressed down, etc.
Keyboard API
browser.aiAssistant.createKeyboard()
Keyboard.press(...keys)
Keyboard.release(...keys)
Keyboard.getState() // which keys are pressed down
Vision and Audio API: Pretty much just TabCapture/DesktopCapture API without restrictions, and more granular options.
@bershanskiy
Content script can transplant event listeners around and chain them in weird ways to trigger multiple event listeners by the same real trusted Event.
I tested this out, and it doesn't seem to be the case. If you dispatch a trusted event manually, it's trusted status resets back to false.
Closing this as the request isn't specific enough for us to meaningfully discuss it.
This new powerful API will be used by AI assistant extensions. Anything the user can do, the assistant should be able to do. This will be a very permissive API, but the benefits will outweigh the risks.
pyautogui
. For example,cursor.moveTo(x, y)
By virtual keyboard, mouse, I don't mean controlling the user's mouse/keyboard, the assistant should have their own virtual cursor and keyboard, and be able to collaborate with the user in real time (like Google Docs)
Later features (not necessary for initial API).
You can imagine the accessibility and productivity benefits such an API could bring.