rust-windowing / softbuffer

Easily write an image to a window
Apache License 2.0
334 stars 48 forks source link

Investigate more optimal way to implement CoreGraphics backend #83

Open ids1024 opened 1 year ago

ids1024 commented 1 year ago

Apparently macOS (and iOS, https://github.com/rust-windowing/softbuffer/issues/43) has a framework called IOSurface for exchanging framebuffers and textures between processes, which sounds similar to the idea behind dmabufs on Linux. I think we should be use IOSurfaces for a front and back buffer, and use IOSurfaceGetBaseAddress to get a pointer to write into for no-copy presentation (https://github.com/rust-windowing/softbuffer/pull/65)? Assuming it can work with the right pixel format.

Or are there issues with this, or a better way?

ids1024 commented 1 year ago

http://russbishop.net/cross-process-rendering describes how it is possible to create an IOSurface with a size and format, access it from CPU, and set it as the contents of a CALayer.

ids1024 commented 1 year ago

Or possibly we could just use CGImage is is currently used, but with a CGDataProvider that reads from memory we can mutate? Presumably CGImage/CGDataProvider assume the memory isn't mutated, but we could do that once the provider is released. But that isn't so simple without any guarantees about when it will be released, and since we probably can't block waiting for that either.

Edit: See https://developer.apple.com/documentation/coregraphics/cgdataproviderreleasedatacallback:

When Core Graphics no longer needs direct access to your provider data, your function is called. You may safely modify, move, or release your provider data at this time.

ids1024 commented 1 year ago

So comparing these:

For performance concerns, benchmarking is best. But we'd need a representative benchmark, an implementation of both, and multiple types of hardware.

ids1024 commented 1 year ago

Oh, I forgot about buffer stride.

Testing this (https://github.com/rust-windowing/softbuffer/pull/95), it looks like we can't just set the stride to always match the width, so to use IOSurface we'd need to provide a Buffer::stride method. And users of the library would have to consider that.

This would probably also be needed for https://github.com/rust-windowing/softbuffer/issues/42. Or if we wanted to use dmabufs instead of shm on wayland, etc.

LoganDark commented 1 year ago
  • someone with a Mac that has a discrete GPU

Could be me, have a Mac right here with an AMD DGPU, as long as IOSurface exists on macOS 10.14.

ids1024 commented 1 year ago

https://developer.apple.com/documentation/iosurface says it was introduced in macOS 10.6 (sorry PowerMac G5 users), so that much shouldn't be an issue.

LoganDark commented 1 year ago

Great, I could proceed forward with:

And as a bonus, implementing it all the way back on macOS 10.14 would ensure that softbuffer still works back to at least that version. (No reason why it shouldn't, but it's a personal goal of mine to keep those old intels supported!)

I should be free to do any of those in around an hour :)

LoganDark commented 1 year ago

Reading up it looks like you're talking about having to expose a stride, let me introduce: imgref! If softbuffer needs a 0.4.0 for this, I'd be glad to participate in that API redesign since I've worked with these types of signatures somewhat extensively (grumble grumble looks at unreleased pixels competitor). But anyway, take a look at my proposal above and see if anything looks reasonable to you. :)

ids1024 commented 1 year ago

https://github.com/rust-windowing/softbuffer/pull/95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the winit and animation examples to use this. https://github.com/rust-windowing/softbuffer/pull/96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.

I wonder if there's a good way to automate benchmarking of softbuffer performance.

LoganDark commented 1 year ago

95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the winit and animation examples to use this. #96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.

I wonder if there's a good way to automate benchmarking of softbuffer performance.

I'll check them out. I don't have an M1 to test with, but if you do, that should cover everything. My benchmark method typically tends to be instrumentation using Instant::now(), it's not perfect but the margin of error is usually somewhere on the order of milliseconds and copies of large buffers are usually much more expensive than that so it should be good. (I'll figure it out when I have my paws on some local tests)

Once I have some thoughts I'll leave them on the relevant PR, or here if they affect both or are in general.

LoganDark commented 1 year ago

Alright, so based on my testing, for total render times:

I think the 16ms might be a fluke here, it makes you think it might be vsync but it's consistently lower than 16ms for small windows and consistently higher than 16ms for larger-than-screen windows. In fullscreen, it doesn't seem to ever take longer than 18ms or so, but this is still beat by master's 7ms.

Also, copy-to-iosurface is clearly worthless and should be scrapped, as benchmarks prove that more copies won't help anything. /hj

Here are some more detailed breakdowns per-branch:

Now Wait Just A Minute, there's something fishy here.

Let's see:

This makes me wonder if IOSurface is somehow magical! The memory backing it seems to somehow be more expensive than normal memory, perhaps it's some sort of MMIO or something. Anyway, this prompted me to do some more testing. My method of filling buffers quickly is to use rayon to fill it using multiple threads, so let's try that:

Much better?

As far as I can tell, iosurface-wip is the way to go, because it's a lot more consistent than master even if it's slightly slower to write. Meanwhile copy-to-iosurface... yeah. Throw it in the bin, lol

lunixbochs commented 1 year ago

ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory

LoganDark commented 1 year ago

ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory

Of course, I was assuming that @ids1024 (or someone else) would get back to me with comparisons on ASi to see if iosurface-wip really is the best choice for both, but it seems like that hasn't happened yet.

lunixbochs commented 1 year ago

what's the easiest way to repro your test?

LoganDark commented 1 year ago

what's the easiest way to repro your test?

instrument the code with some Instant::now()s, then eprintln!("took {}us", (b - a).as_micros()); at the end of the frame. I don't have an exact diff

lmglmg commented 10 months ago

On the master branch, the winit example consumer very large amount of memory when continuously resized. This issue seems to be fixed on the iosurface-wipbranch. I tested this on a M1 mac.