treeform / pixie

Full-featured 2d graphics library for Nim.
MIT License
741 stars 28 forks source link

add switch to use BGRA color byte order in image #360

Closed levovix0 closed 2 years ago

levovix0 commented 2 years ago

bgra byte order is standard for software rendering on window

guzba commented 2 years ago

Hello and thanks for the suggestion but I do not anticipate us adding this.

It would require writing a second version of every single draw, blend, and SIMD codepath to support the different byte order. Further, it doesn't actually matter. On Windows, you should be using OpenGL, DirectX or etc for all screen compositing (actually, this is true on all operating systems), not trying to poke bytes into the old win32 API.

Maybe a function that re-orders a Pixie Image into a seq[ColorBGRA] for compatibility isn't crazy though.

treeform commented 2 years ago

I would call BGRA standard for old win32 apis, but that's about it.

We are pretty set on the RGBA byte order. GPU's take any. And what @guzba said.

Closing this for now. Feel free to reopen if I am missing some thing.

levovix0 commented 2 years ago

On Windows, you should be using OpenGL, DirectX or etc for all screen compositing (actually, this is true on all operating systems)

Not the case for lightweight applications, just creating GPU rendering context on window takes ~50mb of RAM and slightly, but noticeably slows down the opening of the window.

Maybe a function that re-orders a Pixie Image into a seq[ColorBGRA] for compatibility isn't crazy though.

Using this function for compatibility slowed down displaying image on window procedure in 1000 times! And took half of whole draw time in my case.

treeform commented 2 years ago

I think windows APIs mostly take BGR without alpha. So we need a Pixie Image into a seq[ColorBGR] we can't drop alpha from our spec just to be compatible with old windows APIs. I am sure a RGBA to BGR converter can be made faster with SIMD.

levovix0 commented 2 years ago

I think windows APIs mostly take BGR without alpha

Fortunately not, they either ignore the fourth byte or use it as an alpha channel.

guzba commented 2 years ago

Not the case for lightweight applications, just creating GPU rendering context on window takes ~50mb of RAM and slightly, but noticeably slows down the opening of the window.

In my opinion this is not a good argument for the amount of code you're requesting we write and support forever. 1) Only super light-weight applications would even want this, 2) tiny tiny bit slower for only these very basic apps (measure this to quantify it) 3) 50 MB RAM (which is probably allocated anyway just assigned to OS instead of your app since the pixel bytes you send to Windows end up on the GPU anyway. You're just having Windows do it instead of doing it yourself).

guzba commented 2 years ago

I also doubt the 50 MB number tbh. I can open a new blank window on Windows 10 with an OpenGL context and it takes ~19MB. See windy examples/basic.nim.

treeform commented 2 years ago

Fortunately not, they either ignore the fourth byte or use it as an alpha channel.

I am not well versed in low level win32 APIs. Here is an app that is non-opengl plain windows. It creates a BITMAPINFO and puts the pixie image into that. The BITMAPINFO wants BGR not BGRA. So you still need a conversion to pack stuff without alpha:

https://github.com/treeform/pixie/blob/master/examples/realtime_win32.nim#L49-L62

Most APIs want BITMAPINFO in BGR format.

treeform commented 2 years ago

Also see: https://github.com/treeform/pixie/blob/master/examples/realtime_wnim.nim#L55-L56

treeform commented 2 years ago

I do think we should have RGBA to/from BGR and maybe BGRA easy converters for legacy win32 apis.

levovix0 commented 2 years ago

I can open a new blank window on Windows 10 with an OpenGL context and it takes ~19MB.

On Linux/X11 it takes ~60MB (for windy and for glut), window without OpenGL context takes ~5MB

which is probably allocated anyway just assigned to OS instead of your app since the pixel bytes you send to Windows end up on the GPU anyway. You're just having Windows do it instead of doing it yourself

I don't think compositor creates GPU context for every window that we create, send image to GPU and create GPU context is different operations

https://github.com/treeform/pixie/blob/master/examples/realtime_win32.nim#L49-L62

in this code aligning is 4 bytes, 4th is ignored by winapi

https://github.com/treeform/pixie/blob/master/examples/realtime_wnim.nim#L55-L56

this code uses argb. I can't for now test it on windows, but if i do same thing on Linux/X11, XPutImage will just swap bytes by himself see https://github.com/FolxTeam/plainwindy/blob/3c02cee3c61c8d85bf2a69bef2d3388d333dff50/src/plainwindy/platforms/linux/x11.nim#L323, https://github.com/FolxTeam/plainwindy/blob/3c02cee3c61c8d85bf2a69bef2d3388d333dff50/src/plainwindy/platforms/linux/x11.nim#L281

the amount of code you're requesting

Not too much, i have forked choma and pixie and changed byte order to bgra for tests (see https://github.com/FolxTeam/chroma/commit/a96d428356798d90dc0b468ab554fd5a5aa90236, https://github.com/FolxTeam/pixie/commit/aaed41f4dc69e13912e2044851d2f0ab04fa2db3), just put when defined someFlag to places where edits was and where they needed (the use case is really too specific to do something more complex than this) If it is still to much, please just add when defined someFlag on ColorRGB, ColorRGBA, ColorRGBX objects

guzba commented 2 years ago

Not too much, i have forked choma and pixie and changed byte order to bgra for tests

You have not touched any of the SIMD, nor any of the other operations that can happen with an image or mask. This is not even close to how much code will need to be written. Does every single test run by nimble test pass without any of the output images changing?

levovix0 commented 2 years ago

Oh seems like no I didn't know that your code is so dependent on byte order

guzba commented 2 years ago

I don't think compositor creates GPU context for every window that we create, send image to GPU and create GPU context is different operations

For GPU accelerated desktop compositing, everything must be on the GPU. If a desktop is not doing GPU compositing, that'd be something maybe Linux can do? Certainly not modern Mac or Windows. Either way, this isn't really a big deal.

Take a step back and consider there are only a couple scenarios that can matter:

1) If you want a Window without a GPU context and 2) If you want to send BGRA directly to a system API

If 2) is what you care about (it sounds like it), you'll either be updating the window frequently or infrequently.

If frequently (like every frame), this is a very bad idea. It is not wise to take on tons of CPU compute to save a tiny bit of RAM. You'll make computers sound like jets trying to keep the CPU cool. Look at Boxy and break things into discrete composed parts/rectangles/widgets that can be individually rendered with Pixie only when needed, then composited by the GPU and build up from there. (Even after doing this, you should still only update the window when something changes.)

If infrequently, a fast converter to BGRA is going to be irrelevant to performance and is obviously a better idea. Its like ~20 lines of code (even with SIMD which will drastically increase perf) and won't even be measurable if you only re-draw the window when something changes (which may not happen for seconds or minutes or even hours at a time).

guzba commented 2 years ago

There's a whole additional level of pain here I just realized.

Pixie use premultiplied alpha (RGBX) which is required for performant compositing. I assume the system APIs want straight alpha / not premulitplied alpha BGRA? If that case, we would need premulitplied alpha BGRA (BGRX?) and a converter to BGRA.

Converting back to straight alpha is always horribly slow. There's no good way to make these legacy APIs happy other than simply not using them.

If you care about an additional 40-50 MB RAM compared to the compute trade-off, that seems unexpected to me. When is that little bit of RAM critical to save and you also have tons of spare compute to take on the CPU straight-alpha conversion?

Unless you can always assume opaque (a = 255) Images and thus can skip the conversion to straight alpha, but we cannot assume that at the library level.