Keyboard emulation for interactors

jnicklas commented 3 years ago

I am spinning this off from thefrontside/bigtest#818 as its own issue, since while it is related, this feature also stands on its own.

When writing tests for keyboard interactions, we are often forced by test frameworks to write them in an unnatural way. The keyboard is a physical device which the user can use at any point in time. Pressing keys on the keyboard can have various effects on the document, depending on which element currently has focus, which keys are pressed, and which event handlers are attached.

Under the hood, keyboard events such as keydown and keyup are fired for the currently focused element. This is the level that testing libraries usually attach themselves too, but it is worth considering that this is not how a user sees it. Sometimes the user will deliberately move focus to a certain element before pressing a key, other times, they will be unaware of where focus currently resides.

Since the keyboard is a distinct physical device, and that is how the user sees it, and in BigTest we are trying to model the user's behaviour, why not model the keyboard as such?

How would this work?

The keyboard becomes a global interactor, which can be used both on its own, and also in part of composed actions. My preferred design is to make keyboard interactions part of the Page interactor. It already exists and models the document in its entirety.

Page.pressKey('a', { meta: true });

This would sent the key a to the currently focused element.

We could add a helper for typing longer strings of text, so we don't have to sent each individual keystroke:

Page.typeText('Hello world');

This helper could have a simple DSL for including control characters:

Page.typeText('Hello world<enter>');

We could even have a low level API for holding and releasing keys for even more complex interactions:

Page.holdKey('meta');
Page.holdKey('a');
Page.releaseKey('meta');
Page.releaseKey('a');

Keyboard layouts

A future extension to this feature could be support for various keyboard layouts. A keyboard layout is really a mapping of physical keys to expected characters, and vice versa. From a user's perspective they would normally be hitting a key with a specific character, but there may even be rare occasions where they would hit a key in a specific location, rather than with an expected value.

For example, imagine a game where the WASD keys are used to control movement. Using a qwerty keyboard layout, this is natural and convenient, but on a dvorak keyboard, it would be massively inconvenient. On dvorak we would want to use the keys labeled <AOE for this purpose, since they are in the same location as the WASD keys on dvorak. So in this case, the key location is more important than the key value. It is worth noting that this is a rare case though, and while we should probably support this case, it is not the case we should optimize for.

taras commented 3 years ago

The reasoning sounds solid. Do you prefer Page interactor to Keyboard interactor? If so, why?

jnicklas commented 3 years ago

Do you prefer Page interactor to Keyboard interactor? If so, why?

I think both are fine. My main reason for preferring to attach this to page, is that it already exists, and that we wouldn't need to add any additional components, which makes it feel a bit more compact to me, but I don't have a strong opinion on this.

taras commented 3 years ago

Keyboard Interactor reinforces the idea that "the keyboard is a distinct physical device, and that is how the user sees it."

wKich commented 3 years ago

I'd like 'sendKeys' api from selenium https://www.selenium.dev/documentation/en/webdriver/keyboard/

Keyboard :: Documentation for Selenium
Documentation for Selenium

jnicklas commented 3 years ago

@wKich do you prefer it for compatibility reasons, or for some other reason?

I have to admit to not being a huge fan of control characters as an API, since it requires some higher level API to make it readable, but I could see it as an advantage to copy an already existing API.

wKich commented 3 years ago

The reason is you don't need to write DSL to enter text with the control characters.

But 'Page.pressKey('a', { meta: true });' looks pretty good.

dagda1 commented 3 years ago

Cypress (and others) use the {special chars} syntax

cy.get('input').type('{shift+alt+b}hello')

I wonder if there is some common package to parse them.

wKich commented 3 years ago

The issue starts if you need to type precisely '{ctrl+a}' or something similar. So the user needs to learn yet another syntax and how/when to escape such characters/words. And you will not get a highlighting in your favorite IDE. Use special syntax instead of well-designed API is a very complicated solution.

cowboyd commented 3 years ago

I think we should have both a low-level api for sending individual keystrokes, and also a gesture language because there are certain things that can only be accomplished by relating keystrokes together such as "type this string with a delay of 500ms between each keystroke"

I've actually done some experimentation with a template language that is different from the Cypress and Testing libraries, in that it is a fully complete toolkit for building expression languages that is geared towards text literals

https://github.com/cowboyd/keyster

It is a very rough sketch, but one way to think of it is basically as a Lisp-like syntax that allows for both whitespace AND nullspace tokenization, and that each character is actually an interpreted node in a syntax tree. This allows us to make arbitrarily nested and complex gestures.

{ctrl {shift a}} ;; ctrl-down, shift-down, a-down, a-up, shift-up, ctrl-up

In my sketch, {) expressions are white space tokenized, and @{} expressions are nullspace tokenized. In other words {hi mom} tokenizes as hi mom, whereas @{hi mom} tokenizes as h,i,`,m,o,m`.

This lets us "scope" keystrokes by saying which keystrokes are their children. E.g. the "scope" of the ctrl in this example is just the a, but the scope of the shift is the entire key sequence for "hi mom" in {ctrl a}{shift @{hi mom}}

This seems like a crazy amount of complexity surely, right? But it gives us some super-powers in that we can now embed functions in our language that let us add scoped, contextual behavior such as typing delays. Here's the example from the top:

{withDelay 500 @{it was the best of times, it was the worst of times}}

We could use it to switch keymaps:

{dvorak @{it was the best of times},{qwerty @{it was the worst of times}}

By default of course, we could wrap @{} around a list to optimize for the 95/5 case that we just want to type a simple string. Hello world as the root would actually be implicitly @{hello world}

We can even mix devices such as keyboard and mouse if we wanted:

cowboyd{tab}password{click "submit"}

I think the strongest reason to do this though is that because what is ultimately is parsed is an AST, we can interpret that AST on multiple platforms, so if we make this once, then we can use the same gesture expression language on Web, iOS, Android, or whatever.

Anyway, this is a long way of saying that we should have both the capability to raise low-level events, and that we can also really leap-frog the state of the art of convenient typing syntax in a way that will also work for non-javascript platforms.

GitHub
cowboyd/keyster
Reverse Engineer Keyboard Events. Contribute to cowboyd/keyster development by creating an account on GitHub.

wKich commented 3 years ago

For the case {withDelay 500 @{it was the best of times, it was the worst of times}} I would like to use Page.typeText(withDelay(500, 'Hello world')); or Page.typeText('Hello world', withDelay(500));. Because it's a more natural way to write code. In my case, a user has syntax highlighting and autocompletion out of the box, he doesn't need to learn a new language.

cowboyd{tab}password{click "submit"} this is very complicated because it adds another way to interact with dom elements along with Interactor('submit').click(). And more other, for your case, what submit is? Is it a Button or Link? How should we find the appropriate element?

cowboyd commented 3 years ago

Syntax aside, I think the main thing I'd like to accomplish long-term is the representation of user gestures and gesture sequences as data, or more specifically hierarchical data that can represent scope. I should have led with this idea instead of a concrete syntax.

If we have this baked in, or at least as a goal from the start, then the syntax that we can use to generate that data can be swappable, but we get the power of having something that will work on not just browsers, but iOS and Android, and OSX, and Windows, and GTK all of which we hope to get to.

taras commented 3 years ago

Because it's a more natural way to write code. In my case, a user has syntax highlighting and autocompletion out of the box, he doesn't need to learn a new language.

We will always have users of varieties of technical proficiency. We want the organization or group to be to control their requirements. For one team, this might mean using TypeScript, for another it could mean using something like Gerkin. Overtime, the same team might start with TypeScript and move to Gerkin. We have to design our software in a way that will allow teams to move up and down "no-code to code spectrum" without changing all of their tools.

{withDelay 500 @{it was the best of times, it was the worst of times}} and Page.typeText(withDelay(500, 'Hello world')) are not mutually exclusive. This is the beauty of Lisp syntax, they give you a way to express functional composition in a consistent and parsable manner.

We have some experience with Lisp interperters. We did a spike of FlutterScript which is a LIsp interpeter for Flutter. All of Ember.js HTMLBars sub-expressions is a Lisp interpreter.

Like @cowboyd said, we don't have to have it day one but we should keep it in mind so it can be achievable in the future.

wKich commented 3 years ago

So you mean about to support different variations of API, do you? That makes sense. And yeah, lisp is an excellent language)

jnicklas commented 3 years ago

All of the keyboard interactor functions compose via async functions:

someInteractor.actions({
  someComplexKeyboardInteraction() {
    await Keyboard.holdKey('shift');
    await sleep(10);
    await Keyboard.holdKey('meta');
    await Keyboard.pressKey('a');
    await sleep(10);
    await Keyboard.releaseKey('meta');
    await Keyboard.releaseKey('shift');
  }
});

While this looks fairly longwinded, it's worth considering that this is an absolute edge case! The vast majority of keyboard interactions will not need to look like this, because most likely the user just wants to enter some text. This actually makes a case that even a DSL like I suggested is unnecessary. There is a case to be made against over engineering this.

I can see that there would be a point with introducing some delay between key presses, but it will be the rarest thing where that delay will need to be different for different keys in the same text string, so something like this will work just fine:

await Keyboard.type("hello world", { delay: 50 });

jnicklas commented 3 years ago

TIL there is an experimental webapi for retrieving the current keyboard layout. Unfotunately only Chrome and Edge currently implement this API: https://developer.mozilla.org/en-US/docs/Web/API/KeyboardLayoutMap

KeyboardLayoutMap - Web APIs | MDN
The KeyboardLayoutMap interface of the Keyboard API is a map-like object with functions for retrieving the string associated with specific physical keys.

jenweber commented 3 years ago

I have one thought experiment to add here - imagine that the same API is being used to test a desktop app instead of a browser, like Photoshop’s keyboard controls. Would you want the semantics or experience to change? This framing could help with any decisions that need to be made one way or another.

cowboyd commented 3 years ago

I have one thought experiment to add here - imagine that the same API is being used to test a desktop app instead of a browser, like Photoshop’s keyboard controls. Would you want the semantics or experience to change? This framing could help with any decisions that need to be made one way or another.

In this sense it feels like the api should be centered around the actual physical hardware. In other words, since I use the same keyboard to interact with webapps and desktop apps on my laptop, the interactor api should be similar. On the other hand, I use a software keyboard on my phone and tablet and so the API should have different characteristics.

For example, it's very hard to imagine doing even a simple keystroke like Ctrl-C on a phone, where as that's the simplest one you'll find on a physical keyboard and they only get more complex from there.

thefrontside / interactors

Keyboard emulation for interactors #12

How would this work?

Keyboard layouts