Closed kpocza closed 6 years ago
Thanks @kpocza for supporting WSL and the feedback. Adding @russalex, @jackchammons as FYI. Having some kind of a DDK/plug-in model to extend LxCore is definitely something we are aware of. But, it's not something we are able to deliver on immediately. I encourage you to create a user-voice ticket (if there is not one already) for this request and if there is enough interest, we will definitely look into it.
@kpocza Alex has to rewrite those tools at least once a month because the interfaces in question are not stable. Also, some of this stuff, I think, the team is already programming directly. (Mr. Ionescu mentioned that there is already some kind of stub graphics interface that is hidden down there ( https://github.com/ionescu007/lxss/issues/9 )).
The devs haven't even versioned and documented the usermode API yet. It's going to take some time before they open up the kernel side.
I have to say though, very cool project. I wonder if you could do something more sophisticated than piping the framebuffer and instead make a userspace proxy for libgl that forwards OpenGL to a WPF application for rendering as well as display. That could potentially give dramatically improved performance vs third-party X as well as the software rendering you are doing now.
Yeah I gather the ADSS Bus hack isn't working at the moment. And yes /dev/fb0
once existed briefly (#230).
Beautiful work.
@sunilmut Okay, I'll open a user-voice ticket. @fpqc I'll check the technical possibilities of forwarding OpenGL requests to the WPF app. Great idea, thank you. The other handicap for the way I present framebuffer is that the lxss side mmaps a regular file which is synced to the disk and the WPF side is reading the file periodically and presenting its content. (mmapping on W32 side caused more flickering than reading it periodically). So it needs extensive IO, I don't know whether Cache Manager helps here or not. I tried to open the lxss processes that are accessing the fb to be able read their memory and get direct access to the mmapped fb content from the W32 side (OpenProcess) but it's not allowed. It's only possible to query limited information of lxss processes from W32 side, so it's not surprise that Task Manager can't show too much info.
@kpocza Possibly stupid idea: Could you replace the actual on-disk file /dev/fb0 with an AF_inet UDP socket and then listen on the loopback interface? The idea would be that you could "stream" the framebuffer rather than poll it from disk?
I also heard that there are functions in the undocumented usermode API to share memory directly between an lxss and a Win32 application.
@fpqc - you can totally stream the framebuffer, but that's precisely describing what VNC/TightVNC does. If you can map memory, you win of course. That is what I was talking about back in august. Unfortunately there's still all that low hanging fruit I mentioned left.
@therealkenc -- VNC also does some processing (compression, etc).
Could you simply network-send your entire framebuffer for every frame? I realize that would be horribly slow in real life. But the loopback interface isn't real life :-) If it really is zero-copy, then you're not actually sending anything; you're basically emulating a read-only block of shared memory.
Actually it's not horribly slow at all. That's precisely what's happening when you watch a youtube video playing on firefox on WSL in VcXsrv. How did you think the bytes got there? ;) For what it is worth, it isn't zero-copy in real life because both ends copy the crap out of buffers even if the kernel doesn't internally. But your memory bus pushes 20+GBytes/sec these days so "whatever". A $30 Chromecast dongle has all the horsepower necessary to decompress bytes off WiFi and push them raw over HDMI.
Yes, I was just suggesting streaming over disk writes as a slightly better stopgap solution. I think if you wanted to properly pipe the libgl calls into Windows though, I think you would need to share memory so the WPF application could, for example, expose the textures and models and shaders and whatever into the driver by DMA.
You can marshal every syscall over ipv4 if you feel inclined. The problem with any streaming IPC solution is you have to emulate dirty pages in software. When I change a byte in a texture there's no syscall for Krisztián to sniff. It's completely doable, mind, and is in fact what VMs have to do when there's no hardware virtualization (VT-x). But it probably isn't going to let you play Tomb Raider, and even if it did, there's easier ways to "get there".
Yeah, in this case @kpocza controls both the sender and the receiver process.
I wonder if the NT kernel is clever enough that you could cause it to do DMA by sending a buffer over loopback, and then taking that pointer and immediately handing it off to the relevant graphics subsystem. The loopback send() is supposedly optimized as just sharing the pages anyway...
Zero chance (and I mean zero) of that. Say I've written 4GB worth of textures into my fancy 8GB gaming card. I change a byte. Somewhere deep inside Mesa that byte was memcpy()ed. What page do you propose to send()
?
Even with an IPC mechanism that allows you map memory into two different processes you'll need some kind of synchronization mechanism. If you're familiar with the System V shared memory (which is currently being worked on by the way) it allows you to created shared memory regions that can be mapped into multiple processes. It also exposes cross-process message queues and semaphores that are typically used to synchronize access to the memory.
On the topic of LxBus: Documenting the interface it definitely something we plan on doing eventually. If you remember Project Astoria LxBus was the protocol behind how Android graphics were displayed in Windows Phone apps.
Yep, request for locks between WSL and Windows comes right on the heel of someone getting mmap() between WSL and Windows working. Of course, you can always use spinlocks on the mmapped region and LxBus for synchro when you fail to get the spinlock... 😏 You do know this is a slow death march to someone doing fork()
on the Windows side, right?
@therealkenc , "DMA", your answer's so boring :-)
send()
over loopback of a large, page-aligned block as directly mapping the pages from the sending process into the receiving process.Does this handle mutable structures nicely? Of course not. But I claim that lots of applications don't care. For games, depending on the game, I would expect the bandwidth-intensive stuff to be swapping textures in and out of graphics memory. Textures are static; they're amenable to this sort of pseudo-DMA. For screen-streaming, I would expect @kpocza 's application to eventually need to be double-buffered to avoid screen tearing; then you just swap back and forth between the two buffers, sending one while writing the other.
Granted, real direct DMA would be better :-)
At the moment what we have is a regular file of size 1280x720x4= 3686400 bytes (~=4MB, no double buffering) at /dev/fb0 . This amount of data seems to be feasible to push through the loopback even 50+ times a second. Until the low hanging fruit mentioned by @therealkenc gets support this seems to be feasible.
However for this
mmap could be changed to shmget/shmat calls however according to https://msdn.microsoft.com/en-us/commandline/wsl/release_notes these are not yet supported.
I may check whether anonymous mmapping is supported and if it's possible to inject a thread or any piece of code to the controlled lxss process by loweagent. This code could directly read the anonymously mmapped memory and send it on a loopback socket.
@kpocza Stupid question again: Why couldn't you use something like socat
to make /dev/fb0
as a socket directly?
Something like this: http://stackoverflow.com/questions/2149564/redirecting-tcp-traffic-to-a-unix-domain-socket-under-linux but in reverse.
@fpqc So turning /dev/fb0 to a UNIX domain socket? The problem is that the application is requiring /dev/fb0 to be a contiguous ~4MB area where any pixel can be changed by the app at anytime at any x,y indices. "appending" is not allowed.
@kpocza Yeah, looking into it, you'd effectively have to rewrite the remote framebuffed protocol.
@aseering. "Implement send() over loopback of a large, page-aligned block as directly mapping the pages from the sending process into the receiving process."
Again, which "large page-aligned block" do you intend to send()
every say 30 times a second? You can't say "all of it", because the gpu context of the game is not in contiguous userspace memory. And even if it were contiguous (it isn't), what would you do with the receiving buffer on the other side? "Nothing" isn't the answer, because the order of the memory accesses on the sending side matter, and you have no idea what order those memory accesses were made at the point you do the send and get the recv. You have to synchronize, like Ben was saying.
There's a classic lkml post from Linus on the subject of zero-copy that's worth a read for anyone interested in this sort of thing. Yep you can do cool things like GPUDirect, but now you're talking about kernel drivers for the purpose, not a magical userspace send()
.
You can send()
a contiguous framebuffer n times a second, sure. That's a given and that's what X does.
@therealkenc I was betting it was going to be the Linus post where he curses out FreeBSD's zero-copy sendfile() implementation for copying the HP-UX one.
Good bet; also, say no to drugs.
@therealkenc -- you're a bright person who knows systems stuff really well; I have a hard time believing that you couldn't correctly implement a mutex in terms of sockets if you really really had to.
It sounds like you are trying to find the ways in which this is impossible. I'm trying to find the ways in which this is impossible and come up with the computationally-cheapest workarounds. I'm mostly doing so for personal amusement at this point, so if you're not interested, I'll take a pass on this thread.
(Incidentally, don't get me started on Linus: He's extremely smart, but he's even more opinionated; you have to be twice as smart as he is to understand what he really means, which means that basically no one does and that his statements spin off piles of trolls who think they understand his points but really don't. It's amusing when they start arguing with each other... Anyway, if you peel back the topmost technical layers on the say no to drugs post, there's a commentary on code complexity, which is totally valid for Linux but which I don't care about here because I'm trying to construct a horrible hack and can afford to special-case everything :-) )
It's all a time sink for amusement. Yes you can do mutexes with a socket. It doesn't help because you don't know when the application is writing to memory. You only know when syscalls happen.
I create "Worst GPU Ever" using an off the shelf PCI FPGA development board. It has two 32-bit addresses. Address 0 stores a 4-byte integer. Address 1 stores another. My Linux driver has an ioctl that writes addr[0] + addr[1] to addr[0] once my FPGA has signaled to the kernel driver that it has finished the adding the registers. I write a Windows driver too.
Here's a sequence of code in my Linux application:
int fd = open("/dev/worstgpu", O_RDWR);
uint32_t* addr = (uint32_t*)(mmap("/dev/worstgpu", 2*sizeof(uint32_t),
PROT_READ|PROT_WRITE, fd, 0);
addr[0] = 1;
addr[1] = 2;
ioctl(fd, WORST_MATH_PIPELINE_FLUSH); // addr[0] = 3
printf("result: %d\n", addr[0]);
The only thing you see in the strace is an open()
, a mmap()
, a ioctl()
, and an unrelated write()
to stdout. You see the address mapped. It's contiguous. You know the length. Let's say it's even page aligned. But now... what?
You can't just blindly send those two addresses back and forth every time you happen to see that fd touched. You have no idea what that ioctl does. You don't know the result is in addr[0]. Critically, you don't know if another thread is busy diddling those addresses (or in a real case who knows what addresses) while you are doing the round trip. If you send everything back, you'll clobber anything that was touched in the meantime on return.
The problem is you are insisting on send()
semantics, which cannot give you what you need. It has nothing to do with zero copy tricks. Those tricks may enhance the performance, but do not affect the semantics.
Of course you can write a specialized protocol that knows all about my FPGA, my Linux driver, and my Windows driver which streams add
operations back and forth with send()
. But you don't know anything about that. All you have is a black box of memory, and no idea when something has written to it.
All you have is a black box of memory, and no idea when something has written to it.
Remember, in this scenario, you also control the application that's doing the writing, and can instrument it however you want. Preferably efficiently / at a higher level of abstraction than the individual page.
Also -- I'm not looking for zero-copy semantics. Just "zero-copy performance", which I'll define to mean "minimizing the overhead of this slightly-ridiculous abstraction as much as possible".
Also -- we've now well and truly co-opted the OP's thread. Sorry about that... I've posted this thread here, if anyone cares to continue:
Back to the main topic from the applicability point of view. process_vm_readv is supported by WSL, so it's possible to read the target process memory without disturbance. According to my PoC, grabbing the return value of mmap of /dev/fb0 and running process_vm_readv repeatedly in a separate thread against that address (maybe in a timer later) the fb memory can be captured. Later it can be repeatedly sent to the Win32 side on a socket (most probably in a separate thread).
Which is a bit embarrassing that I couldn't change the mmap mode to anonymous+private by setting the respective registers before the syscall via ptrace. Maybe I'm lame... will turn out.
+1 on the OP's request.
I am investigating the possibility of writing a FUSE driver for WSL. However WSL drivers do not seem to follow the familiar IRP model, and instead seem to use a VFS-compatible interface. Unfortunately there is very little information other than Alex Ionescu's lxss project.
I understand that WSL is probably a moving target right now and rapidly evolving. But please provide us documentation once you find that the internal interfaces have stabilized enough.
I still don't know if there is a User Voice ticket for this, as @sunilmut suggested. I am skeptical of the number of hits it will get, but I'll certainly vote for it. I would really like to see this stuff published, even if informally in a blog post. I can appreciate how having to interface with the WinRT or MSDN/TechNet people would burn a lot of cycles. And who likes writing docs. But the "interface isn't stable" narrative is starting to approach its best-before expiry date.
Fun with lxss isn't going to help you much with FUSE, which I mentioned in #1309 (message). You can use it to marshal anything you want to and from win32 (including a filesystem, if you wanted), but that doesn't get you FUSE on WSL.
@therealkenc wrote:
Fun with lxss isn't going to help you much with FUSE, which I mentioned in #1309 (message). You can use it to marshal anything you want to and from win32 (including a filesystem, if you wanted), but that doesn't get you FUSE on WSL.
I agree. I only looked at the lxss project because it was suggested to me that it has (had?) a working driver on WSL. Its method of doing marshaling is not how I would approach doing IPC for FUSE purposes. Rather I would create a version of WinFsp I/O queues: https://github.com/billziss-gh/winfsp/wiki/WinFsp-as-an-IPC-Mechanism
Full disclosure: I am the author of WinFsp which adds FUSE capabilities to Windows. Hence my interest in FUSE for WSL.
I'm working on this side project: https://github.com/kpocza/LoWe
It's using ptrace to intercept and modify syscalls to emulate some devices that are required to run X server, video playback with sound (ALSA), etc. I can run GUI apps without a third party X server.
However according to this: https://github.com/ionescu007/lxss/tree/master/lxdrv it's possible to create custom /dev-s. This guy reverse engineered a lot of stuff. My project would behave much better if I could implement the functionality of the devices like /dev/fb0, /dev/input/mice, /dev/snd/pcm*, etc. in kernel mode.
It would be almost impossible to reliably reverse engineer lxcore.sys in more details because of various reasons.
My suggestion is to document the public interface functions of lxcore.sys and add the required headers and other stuff to WDK.