Open kermitfrog opened 6 months ago
Reserved
This seems so simple and intuitive as a solution. As you mentioned, the only immediate concern is that it might require long chains of virtual devices. Honestly I'd say that any inherent performance issue with that is probably unintentional.
If I were a kernel dev and someone asked me if there were performance issues, I'd probably say, "I don't know, are there performance issues?" :laughing: I guess we'll probably have to try it and find out, to find out.... but opening a simple, static chain of uinput devices and passing between them should hopefully be fairly simple. I'm actually thinking that interception might be able to do it, out of the box?
I had a look at interception. It seems to do some of those things, but not everything. The biggest difference is that interception starts processes and pipes them together, which has some limitations/problems, e.g.:
But their udevmon code might prove valueble in order to understand the udev APIs :) -- I don't find the official docs very helpful.
I had a look at interception. It seems to do some of those things, but not everything.
Sorry, what I meant was that, Interception might be useful to test the effect of having many uinput devices open... as in, maybe Interception can help to answer your question:
May create a lot of virtual input devices. Is this bad for performance?
I agree, it would be too limited to reach your intended goal.
I suppose that the biggest problem that needs to be solved is indeed making several input mappers use each other output, in a way that does not require the end user to manually configure input and output devices for every single mapper they use.
A daemon which a mapper could ask "I want to map keyboard devices to keyboard devices. Give me the input and output devices I should use." would indeed solve that problem. Without any configuration on the user's side, the daemon could ensure that each mapper get put on a single deterministic part of the chain, and if the user doesn't like the order the daemon automatically choose, then they can reorder it easily in a single GUI written for the UIO daemon, without having to reconfigure each mapper manually.
That shifts the task from convincing the Wayland crew from using a new protocol to:
Which may potentially be easier, but it really depends on how willing the majority of the input mapper developers are to go along with it.
Does not require big changes to existing mappers (I hope).
I do have several thoughts regarding whether it is possible to create a sufficiently transparent wrapper like UIOInput that does not require big changes to existing mappers, but no coherent conclusion regarding that yet.
Currently my biggest worry is how this is going to affect the event loop: on a low level, mappers would now need to maintain an open communication channel with the UIOInput daemon (whether over D-Bus or a Unix socket) and may occassionally need to change which event devices they have open, and thus change which file descriptors they poll/epoll. I think that abstracting that away would significantly decrease performance, requiring the high-performance oriented mappers to do some nontrivial plumbing around their event loop. But I'm not wholly sure of that yet. There are many options to consider here.
how to handle crashing mappers? can they kill UIO?
I think that UIO should be designed such that a crashing mapper cannot crash UIO.
I think it would be acceptable for a crashing mapper to crash UIO if mappers were written as shared objects (.so) that are dynamically loaded into UIO's memory space, kind of like a kernel module getting loaded into the kernel. That would greatly increase performance at the cost of making mappers harder to write and allowing one of them to bring down the whole house of cards.
As long as we do not make the tradeoff of allowing mappers to enter UIO's memory space, crashing mappers should not crash UIO.
May create a lot of virtual input devices. Is this bad for performance?
I've written a small benchmark with python-evdev to check how fast my program evsieve can grab and mirror an input device 750 times:
#!/usr/bin/env python3
import asyncio
import evdev
import evdev.ecodes as e
import os
import subprocess as sp
import time
ALPHABET = list("abcdefghijklmnopqrstuvwxyz")
NUM_KEYS_TO_SEND = 200
TIME_BETWEEN_KEYS = 0.1
# Create a device that we will send events into.
capabilities = {
e.EV_KEY: [
e.ecodes["KEY_" + key.upper()]
for key in ALPHABET
]
}
input_device = evdev.UInput(capabilities, name="virtual-keyboard")
INPUT_DEVICE_SYMLINK = "/dev/input/by-id/benchmark-0"
if os.path.islink("/dev/input/by-id/benchmark-0"):
os.unlink(INPUT_DEVICE_SYMLINK)
sp.run(["ln", "-s", "--", input_device.device, INPUT_DEVICE_SYMLINK])
# Creates one layer that clones the previous layer's input device.
def create_layer(index: int):
input_path = f"/dev/input/by-id/benchmark-{index}"
output_path = f"/dev/input/by-id/benchmark-{index+1}"
args = ["systemd-run", "--service-type=notify", "--collect", "evsieve"]
args += ["--input", "grab", "persist=exit", input_path]
args += ["--output", f"create-link={output_path}"]
sp.run(args)
# Create all layers.
NUM_LAYERS = 750
for i in range(NUM_LAYERS):
print(f"Creating device {i+1}/{NUM_LAYERS}")
create_layer(i)
# Then open the device created by the last layer.
output_device = evdev.InputDevice(f"/dev/input/by-id/benchmark-{NUM_LAYERS}")
output_device.grab()
# Sends events to the input device, then closes the input device when done.
async def send_events_then_close(device):
timestamps_of_sending_events = []
for event_index in range(NUM_KEYS_TO_SEND):
keycode = e.ecodes[f"KEY_{ALPHABET[event_index%len(ALPHABET)].upper()}"]
timestamps_of_sending_events.append(time.time())
device.write(e.EV_KEY, keycode, 1)
device.syn()
await asyncio.sleep(TIME_BETWEEN_KEYS / 2)
timestamps_of_sending_events.append(time.time())
device.write(e.EV_KEY, keycode, 0)
device.syn()
await asyncio.sleep(TIME_BETWEEN_KEYS / 2)
# Give the other tasks some time to finish reading events before we exit.
await asyncio.sleep(1.0)
device.close()
return timestamps_of_sending_events
# Measure the time of which the events that we can observe from the event devices.
async def read_events(device):
timestamps_of_reading_events = []
try:
async for event in device.async_read_loop():
if event.type == e.EV_KEY:
timestamps_of_reading_events.append(time.time())
except OSError:
return timestamps_of_reading_events
# Tell the user what the average difference between the input and output events is.
def present_report(timestamps_in, timestamps_out):
total_delta = 0
count = 0
assert(len(timestamps_in) == len(timestamps_out))
# Measure the total difference between the time at which we wrote events to the input device
# and the time the event showed up at the output device after being mapped through NUM_LAYERS
# amount of layers.
for time_in, time_out in zip(timestamps_in, timestamps_out):
total_delta += (time_out - time_in)
count += 1
MICROSECONDS_PER_SECOND = 1000000
print("")
print(f"Average delay of {round(total_delta/count/NUM_LAYERS * MICROSECONDS_PER_SECOND * 10)/10} microseconds per layer per event over {count} events and {NUM_LAYERS} layers.")
async def main():
timestaps_in, timestamps_out = await asyncio.gather(
send_events_then_close(input_device),
read_events(output_device),
)
present_report(timestaps_in, timestamps_out)
asyncio.run(main())
On my system, it outputs
Average delay of 41.5 microseconds per layer per event over 400 events and 750 layers.
There does not appear to be any worse-than-linear scaling involved as the chain of input devices becomes longer. At least, for the purpose of event latency. Maybe some other programs are poorly equipped to handle a large number of input devices. For example, libinput will probably need to open every single input device even if most of them are grabbed. The epoll
syscall can read events from any number of devices in an O(1) amount of time, so an efficient program that uses epoll shouldn't be slowed down by having additional devices that do not actually generate events other than by the one-time cost of opening them all.
Also, another thing I ran into: there was a limit to how many layers I could use in the above benchmark. Specifically, 776 layers was the maximum my system could handle. I'm not sure why that specific number. It does seem to be possible to create more UInput devices than said arbitrary limit, but those devices do not show up under /dev/input/event*, and as such are practically invisible to the rest of the system.
Based on ls /dev/input
, the event device numbers only go up to /dev/input/event1023
. I don't know why the maximum amount of layers my system could handle was 776 instead of ~1000.
A maximum of ~1024 event devices is not an unreachable cap, but still one that will in practice probably not be met that often. Maybe the cap is arbitrary and could be raised by the kernel devs if there is a need to, or maybe there are more fundamental reasons for the cap like a limited amount of device node numbers in some POSIX standard.
First thanks for the feedback and the performance testing :).
So, this is what I did so far with UIO (which is far less than I had hoped to do in that timeframe :/ ) :
I started writing a prototype for UIO. A good deal of time got into learning new stuff that I had not used before (also getting me forward on my journey to master rust :) ). From what I read, the only sensible way to pass access to uinput devices to another program seems to be unix-sockets - so I started learning about those. I knew poll from C/C++, but not epoll (which seems the right choice here). I know that asking for help would have made this easier, but there was a point that I wanted to get to on my own for learning purposes. But don't worry - I'm done with that now ;)
What I got so far is a daemon and a little test client. The client can request a specific input device from the server and gets a file descriptor from which it can read events.
A good amount of time also got into rethinking details (multiple times). So far I learned...
ctx.open()
) followed by requesting a (matching) output device (ctx.get_output()
) won't work well. Instead groups of matching in/out devices should be requested in a single call. The reason is: If M1 and M3 are already active and M2 should be inserted between them, UIO needs to take the input from M3 and give it to M2, leaving M3 without input until M2 has requested it's output, which can then be given to M3. I'd rather replace M3's input in one go, which is only possible if UIO knows what that output will be like.Because of 2 & 3 I have by doubts that using uinput devices really is a good idea. So far the only real advantage compared to a deamon that maybe facilitates shared memory between mappers or simply forwards the events seem to be that the kernel will take care of some things (e.g. filter out invalid events). Both of these approaches might be much simpler to implement.
What do you think?
The exact approach from above that requests an input (ctx.open()) followed by requesting a (matching) output device (ctx.get_output()) won't work well. Instead groups of matching in/out devices should be requested in a single call.
This is a good insight.
I do want to emphasize that groups of matching in/out devices does not mean pairs of in/out devices. It is for example imaginable that some mapper wants to take a joystick as input and generate both a keyboard and mouse device as output.
The idea that using uinput devices will make it easier to port existing mappers to use UIO is BS. The only things that won't change are that input events are read and written from/to some stream and that the mappper has to know which output events it will create before a virtual output device can be created. And I see no reason why those would change if the approach didn't involve intermediary uinput devices.
This is a good point. The thought of using uinput devices was that mappers could simply use libevdev
everywhere except for the part that opens/creates uinput devices. But then again, they are unlikely to use libevdev
for much else than opening/creating devices. Sure, they also use it to read/write events, but that are only a few lines of code that need to be changed.
It's good to be aware that there is no need to stick to uinput devices and that we have other options available.
At the same time, I haven't found a clearly better alternative yet:
Loading the mappers as shared objects into the memory space of UIO This would of course offer the best possible performance because the overhead of each layer is only slightly bigger than a function call. It is however basically the userspace equivalent of running programs in ring0: it does have the best performance, but a single mapper crashing can bring the entire input system down.
This is also only a viable option for mappers that are written in system languages without a runtime (i.e. C, C++, Rust, or Zig). Even for such mappers, there would be additional development overhead because each mapper needs to be able to clean up all its own memory if the mapper gets unloaded. Even languages as Rust do AFAIK not provide such functionality automatically, because Rust does not drop static variables.
(That does not mean that it would be impossible to write mappers in non-system languages like Python; we could offer multiple options like "either get loaded into UIO's memory space, or communicate over pipes", with mappers written in Python choosing the latter option.)
Running mappers as separate processes with shared memory Shared memory sounds really fast, but it does not release us from the kernel's scheduler. Most of the time, each mapper should be idle, telling the scheduler to wake the mapper process again when an input is available.
Fortunately, this seems to be achievable using POSIX semaphores, which are basically slightly generalized mutexes sharable between multiple processes. Mapper B can use sem_wait
to wait until input events become available, and then Mapper A can sem_post
after it made input events available for the waiting Mapper B.
However, that still means that after Mapper A writes events to shared memory, we still need to wait until the kernel wakes up mapper B before the processing continues. I don't have proof for this, but I believe that waiting for the scheduler to give a piece of time to an idle mapper process is the biggest source of latency.
The kernel knows that Mapper B can be scheduled immediately because of a semaphore, but that is also true for other communication methods: if a virtual input device or pipe is used for communication between mappers, then the kernel would also know that Mapper B can run as soon as something is written to the input device or pipe. I haven't benchmarked this, but it is very possible that waiting on a semaphore has the same latency as waiting on an input device or pipe.
Also, the above method allows the kernel scheduler to immediately start the process that was waiting, but does not require it to do so. Even if Mapper A immediately calls sched_yield
to ask the kernel to give its slice of computation time to another thread, there is no guarantee that the kernel will schedule Mapper B; the kernel is free to schedule a plethora of other programs before Mapper B.
According to the discussion I found here, there may be some way to get something done using custom scheduling groups and the Linux-specific FUTEX_WAIT_REQUEUE_PI
operation, but this is complicated stuff I haven't gotten fully through yet.
Letting mappers communicate using pipes
Basically the same as letting them communicate through uinput devices, except (1) we don't pollute the /dev/input
space, and (2) we need a custom protocol to decide how we handle communication about initial state and capabilities over pipes.
Furthermore, if there is only a single mapper running (the most common usecase!), then there would be significant overhead because UIO would have to translate an input device to a pipe, the mapper would translate a pipe of input events to a pipe of output events, and then UIO would have to translate the pipe of output events to an uinput device. That requires three read/write cycles, whereas only a single read/write cycle would be necessary if the mapper read directly from the real input device and wrote to the real output device.
UIO needs to ask mappers to give back file descriptors. I don't think that a mapper keeping, but not using a file descriptor, would actually interfere with other mappers. But there might be other issues like not beeing able to close it when it's not needed anymore. Also there would be no guarantees for exclusiveness. So far the only way I can think of to enforce this is to kill the mapper. I have not done any real research into this topic yet, but it seems complicated.
Another option could be to not reuse event devices that were allocated to mappers that quit. E.g. if the chain is (A → B → C), and B quits to reduce the chain to (A → C), then we could give Mapper A a brand new output device and Mapper C a brand new input device, and close the devices that were previously used for the (A → B) and (B → C) transitions.
This does have the disadvantage that the state of the input devices would be lost, e.g. if a user was pressing and holding the A key, then that key might get released if any mapper quits. This matters for the usecase of transient mappers, like some xdotool-like program typing a few keys and then quitting.
So far the only real advantage compared to a deamon that maybe facilitates shared memory between mappers or simply forwards the events seem to be that the kernel will take care of some things (e.g. filter out invalid events).
An additional bonus is that it is possible to ask the kernel about the current state of the event device, e.g. you can tell where a current absolute axis is or whether a certain key is pressed before you receive any events related to them. This is handy when a device gets handed over to a different process. I suppose that that could also be achieved if all events need to be routed through the UIO daemon instead of mapper-to-mapper communication, but would be harder to ensure in case of mapper-to-mapper communication.
(A mapper could announce the current state of each device upon exit, but that requires cooperation of the mappers, and fails to work if the mapper crashes or is programmed to just call exit()
without cleaning up.)
But the main reason to stick to uinput devices is just a baseline reluctancy to invent a new event protocol when the current protocol isn't broken, unless there are clear advantages to the new protocol. If communication through shared memory can be shown to indeed have lower latency than communication through virtual input devices, then that would be a good reason to switch to a new protocol using shared memory.
I suppose that using uinput devices is somewhat broken because it pollutes the /dev/input
space, and is possible for a certain Mapper B to crash, which releases all event devices it had grabbed, which carries the risk of the rest of the system shortly obverving the output of Mapper A directly.
(If that is the only issue, it may be possible to convince the kernel to add an API for creating an uinput device that does not show up in /dev/input
, and is only visible to whatever process created it through the returned file descriptor. Since such "hidden" devices can already be created when over 1024 event devices exist, it may not be that hard to implement.)
I do want to emphasize that groups of matching in/out devices does not mean pairs of in/out devices. It is for example imaginable that some mapper wants to take a joystick as input and generate both a keyboard and mouse device as output.
Yes, but... this is also something where things might get more complicated: If, let's say M1 has a mouse input and wants mouse & keyboard as an output and we also have M2 that wants mouse & keyboard as an input.. does it get the virtual keyboard of M1, or the real keyboard? Or perhaps both? It might actually be better in this case for M1 to create the keyboard in a seperate context to avoid confusion (then M2 would clearly get the real keyboard).
Loading the mappers as shared objects into the memory space of UIO
I think this one has too many pitfalls for now. Maybe it could be implemented as an option later.
Running mappers as separate processes with shared memory
Shared memory sounds really fast, but it does not release us from the kernel's scheduler. [..] POSIX semaphores [..] I haven't benchmarked this, but it is very possible that waiting on a semaphore has the same latency as waiting on an input device or pipe.
I had not heard of POSIX semaphores yet - thanks for bringing them up. Multithreading will certainly introduce some latency. In the end benchmarking different methods is the only way to be sure which is best.
Letting mappers communicate using pipes Basically the same as letting them communicate through uinput devices, except (1) we don't pollute the /dev/input space, and (2) we need a custom protocol to decide how we handle communication about initial state and capabilities over pipes.
I wonder how much of the initial state we actually need to handle (assuming you mean things like which keys are pressed and which LEDs are on). This might be a good candidate for not beeing necessary for the first release, but I need to think about it some more.
Other than that, some custom protocol is necessary anyway. It would be nice to keep it simple, of course.
I think what actually bothers me about uinput the most right now is the need to ask processes to switch to another device. This means
In between new devices need to be created or destroyed.
Of course, having shared memory directly between mappers might lead to similiar needs.
If everything in between the real device and the virtual output that is read by wayland is done by UIO simply forwarding stuff to the next mapper (either by unix pipes or shared memory between mapper and UIO) any such change would just be an adjustment of inner state in UIO.
As for xdotool and similiar tools - I think these simply want to insert something that no other mapper cares aboutjust before wayland. So why not just reserve some independent output devices for them?
But let's get back to the "what's best to do now": My plan was (and still is) to provide a library that wraps any communication to UIO. This has one major advantage: whatever approach is selected - ideally it should be possible to implement the others later and maybe even support a mix of all of them without changing the client code. In the case of a single mapper, UIO could be configured so that the mapper simply gets direct access to the input & output devices.
So what I currently guess to be the best plan is:
The big question (2): what is the easiest to implement solution? My guess is UIO passing events between mappers using unix sockets for communication.
I also plan to put the code on github soon, but need to think a bit about licensing (probably LGPL).
UInput Orchestrator (UIO)
What is the idea?
A daemon that manages connections between mappers by creating, assigning and re-assigning multiple uinput devices, but could be extended to something else later. This should result in a stable path where multiple input mappers can process an input device in a deterministic order. Mappers connect through a few functions in a library and can request file descriptors to matching (virtual) input devices. UIO's job is to ensure that
Disclaimer: this is an early draft and I have not done enough research to be sure that it is technically feasible (or even possible).
Let's start with a few diagrams.
The first is about order
Each Mapper has contexts, identified by it's path (abbreviated here to M#) and role. Roles can be requested by the mapper or configured by the user. If configured by the user, the mapper can request available roles from UIO. If a mapper wants to create a context, a GUI asks the user to confirm. "New" is where new contexts pop up by default. All contexts are specific to an input device (although input devices could be grouped for easier configuration).
Startup and first events
I hope this is somewhat clear..
UIO makes sure there is a chain of (for now!) uinput devices. It can open and create input devices, then shares the FDs with mappers through UIOInput/UIOOutput. Virtual devices can be kept open as it deems necessary (e.g. for short lived scripts / a short while after a mapper exits, in case it's just restarting..).
I hope it is possible to manage access rights to the virtual devices in a safe and stable way.
UIOInput and UIOOutput offer transparent read/write functions. My plan is to use uinput for now, but this may be extended and configured to support other ways of communication between mappers like a direct shared buffer (for performance) or one that is managed by UIO and keeps state of all keys (lower performance, but safer handling of some cases).
read_evdev()
means that it returns the event as evdev would. We could add transformations to libinput structs, etc. later.Window change
We may have options to handle cases where a keycode changes while it's pressed. But I'm not sure how/where to do that yet.
Advantages
Disadvantages (for now)
Some open questions
Implementation details
UIOOutputRequirements - a struct holding the parameters by which a fitting output is chosen.