Open jgraham opened 2 years ago
More detail on clauses
: characters in the IME data can either be user input or can be some kind of IME output (i.e. a converted character). These are typically displayed differently in the editor. So APIs can either provide the semantic information about which kind of text it is, and use that to set the style, or can just provide styling information for each character. In either case the user uses the displayed text style to interpret the composition string.
The documentation for the win32 API has some details here: https://docs.microsoft.com/en-us/windows/win32/intl/composition-string and the proposed values here largely match those in that API.
The Browser Testing and Tools Working Group just discussed Actions IME support
.
A list of test scenarios which this proposal is trying to solve will help to understand the scope. IME testing can be very large, but maybe knowing the minimum viable context, it would make it easier to evaluate the effort required for implementing this.
Not directly addressing @karlcow's request yet, but some additional context:
This proposal is intended to work at the level of providing a "virtual IME" whose state is entirely under the control of the test author. Therefore the intent is roughly that if we imagine a general data flow of the form physical input device → OS input handling → IME → browser application, this replaces everything to the left of "browser application" with "virtual IME". So the maximum amount of flexibility we could aim for is to be able to replicate any possible sequence of IME messages/events that the browser could get from the operating system. In practice of course this API is not OS-specific, somewhat higher level, and so it's not reasonable to expect to be able to simulate every possible case in the browser IME handling. But one way to judge whether we're likely to meet the testing requirements for webapps is to verify whether there are any codepaths on the browser side that are commonly triggered by real IMEs but could not be triggered by this API.
What is definitively out of scope is being able to invoke specific real IMEs, or simulate input at the OS/hardware level. Although those things do have significant advantages in some cases, they require a very different approach from the current WebDriver virtual input handling.
WebDriver is currently unable to simulate the action of an IME in user input.
These are widely used, particuarly when inputting scripts where there are far more available characters than keys on a keyboard. That means it's impossible to use WebDriver to adequetely test how web applications behave when inputting these scripts. The behaviour in the face of IME input is also an interopability problem for web browsers, and fixing this is seen as a high priority area for the web.
Conceptually an IME sits between the physical input layer and the application. Typically the IME is activated with some device input e.g. pressing a key on the keyboard. Once triggered the IME generates a candidate composed string that may be updated based on further input, and is at some point committed. During this time, the composed string is typically displayed in application, but styled in a way that makes it distinct from the final input. There is also typically IME-specific UI to suggest different possible completions, but this is quite platform-specific and will be considered out of scope.
In WebDriver the low-level input handling is done through the actions API. This models user input as a set of virtual input devices, which each have an internal state. At each point in time ("tick") an input device can either do nothing ("pause"), or can have an associated action that updates its internal state and causes the relevant events to be emitted to content (e.g. a
keyDown
action on akey
input device will update the internal WebDriver state to signify that the key is depressed, and emit akeydown
event to content).For a given input on a given device the IME can do nothing (i.e. just let the event pass through) or can intercept the event, update its internal state, and cause different events to be emitted instead. For example, consider pressing the "a" key. In the absence of an IME this will cause a
keydown
event withkeyCode
65
, akeypress
event, possibly variousinput
events, and finally akeyup
event also withkeyCode
65
. However if the IME is activated, we get akeydown
event withkeycode
229
, acompositionstart
event, acompositionupdate
event withdata
corresponding to the current IME input selection,input
events, and finally akeyup
event withkeycode
65
. Note in this example that the content never sees akeydown
event withkeycode
65: the fact that the IME intercepted the event changes the key events visible on the page.Later an input (or something not visible to the web page) may cause the composition to be committed, which corresponds to a
compositionend
event.IMEs can apply to non key input e.g. handwriting recognition is a form of IME that depends on pointer input. It may also depend on multiple kinds of input
In terms of the implementation inside the WebDriver spec, the obvious thing would be to add IME as a new kind of input source for actions. However, the fact that it's a layer between the "physical" input devices and the application makes this more complex; to handle cases like "key is pressed and intercepted by IME, other events happen, key is released" we need to a) specify which other inputs in a given tick are being intercepted by the IME and b) Handle the IME-generated state changes after all other inputs (maybe even right at the end of the tick: for something like
pointerMove
which can be spread out into multiple events over time it's not clear how things should work).So a possible proposal is as follows:
We add a new input source type
ime
. That has internal state which is the current composition string.The
ime
input source has two assocaited actions:compositionUpdate
andcompositionEnd
.compositionUpdate
is the main action for updating the composed string. It has the following properties:data
- A string containing the updated value of the composition string. If this is null (or the empty string?) we end the composition.clauses
(optional) - These represent sub-parts of the composition string. Each clause has alength
and atype
. The lengths must add up to the total length ofdata
. Suggested value oftype
are “caret
”, “rawInput
”, “converted
”, “notConverted
”, “targetConverted
” (TODO: clarify the semantics of these). In addition formatting hints may be specified accorind to how the IME would like the range to be handled. These areunderlineColor
,underlineStyle
,backgroundColor
,textColor
. If this is ommitted it's assumed that there's a single clause (TODO: details)handles
(optional) - The input source id for the input source that caused this change in the IME state. If this is provided the internal state of the referenced input source is updated, but the DOM events emitted are those appropriate to the IME instead (e.g. for a keyboard thekeyCode
property becomes229
). If this property is omitted the update to the IME state is not connected to any application-visible input source change (this corresponds to the situation where e.g. the user clicks on a composition string option in a window outside their browser window).A
compositionEnd
action causes the composed string to be emitted has the following properties:data
(optional) - The final composed string to insert. If omitted this is given by thedata
property of the previouscompositionupdate
action.(optional) - As for
compositionupdate`, if committing the composition happens in response to a content-visible input action, this is a reference to the device id for that action.An example of what it looks like on the wire when we press "a" on the keyboard, it generates a composed string "abc", it gets updated to "ABC" by something outside the browser, and it's committed with the space key: