Closed levovix0 closed 2 years ago
Hello and thanks for the suggestion but I do not anticipate us adding this.
It would require writing a second version of every single draw, blend, and SIMD codepath to support the different byte order. Further, it doesn't actually matter. On Windows, you should be using OpenGL, DirectX or etc for all screen compositing (actually, this is true on all operating systems), not trying to poke bytes into the old win32 API.
Maybe a function that re-orders a Pixie Image into a seq[ColorBGRA] for compatibility isn't crazy though.
I would call BGRA standard for old win32 apis, but that's about it.
We are pretty set on the RGBA byte order. GPU's take any. And what @guzba said.
Closing this for now. Feel free to reopen if I am missing some thing.
On Windows, you should be using OpenGL, DirectX or etc for all screen compositing (actually, this is true on all operating systems)
Not the case for lightweight applications, just creating GPU rendering context on window takes ~50mb of RAM and slightly, but noticeably slows down the opening of the window.
Maybe a function that re-orders a Pixie Image into a seq[ColorBGRA] for compatibility isn't crazy though.
Using this function for compatibility slowed down displaying image on window procedure in 1000 times! And took half of whole draw time in my case.
I think windows APIs mostly take BGR without alpha. So we need a Pixie Image into a seq[ColorBGR] we can't drop alpha from our spec just to be compatible with old windows APIs. I am sure a RGBA to BGR converter can be made faster with SIMD.
I think windows APIs mostly take BGR without alpha
Fortunately not, they either ignore the fourth byte or use it as an alpha channel.
Not the case for lightweight applications, just creating GPU rendering context on window takes ~50mb of RAM and slightly, but noticeably slows down the opening of the window.
In my opinion this is not a good argument for the amount of code you're requesting we write and support forever. 1) Only super light-weight applications would even want this, 2) tiny tiny bit slower for only these very basic apps (measure this to quantify it) 3) 50 MB RAM (which is probably allocated anyway just assigned to OS instead of your app since the pixel bytes you send to Windows end up on the GPU anyway. You're just having Windows do it instead of doing it yourself).
I also doubt the 50 MB number tbh. I can open a new blank window on Windows 10 with an OpenGL context and it takes ~19MB. See windy examples/basic.nim.
Fortunately not, they either ignore the fourth byte or use it as an alpha channel.
I am not well versed in low level win32 APIs. Here is an app that is non-opengl plain windows. It creates a BITMAPINFO and puts the pixie image into that. The BITMAPINFO wants BGR not BGRA. So you still need a conversion to pack stuff without alpha:
https://github.com/treeform/pixie/blob/master/examples/realtime_win32.nim#L49-L62
Most APIs want BITMAPINFO in BGR format.
I do think we should have RGBA to/from BGR and maybe BGRA easy converters for legacy win32 apis.
I can open a new blank window on Windows 10 with an OpenGL context and it takes ~19MB.
On Linux/X11 it takes ~60MB (for windy and for glut), window without OpenGL context takes ~5MB
which is probably allocated anyway just assigned to OS instead of your app since the pixel bytes you send to Windows end up on the GPU anyway. You're just having Windows do it instead of doing it yourself
I don't think compositor creates GPU context for every window that we create, send image to GPU and create GPU context is different operations
https://github.com/treeform/pixie/blob/master/examples/realtime_win32.nim#L49-L62
in this code aligning is 4 bytes, 4th is ignored by winapi
https://github.com/treeform/pixie/blob/master/examples/realtime_wnim.nim#L55-L56
this code uses argb. I can't for now test it on windows, but if i do same thing on Linux/X11, XPutImage will just swap bytes by himself see https://github.com/FolxTeam/plainwindy/blob/3c02cee3c61c8d85bf2a69bef2d3388d333dff50/src/plainwindy/platforms/linux/x11.nim#L323, https://github.com/FolxTeam/plainwindy/blob/3c02cee3c61c8d85bf2a69bef2d3388d333dff50/src/plainwindy/platforms/linux/x11.nim#L281
the amount of code you're requesting
Not too much, i have forked choma and pixie and changed byte order to bgra for tests (see https://github.com/FolxTeam/chroma/commit/a96d428356798d90dc0b468ab554fd5a5aa90236, https://github.com/FolxTeam/pixie/commit/aaed41f4dc69e13912e2044851d2f0ab04fa2db3), just put when defined someFlag
to places where edits was and where they needed (the use case is really too specific to do something more complex than this)
If it is still to much, please just add when defined someFlag
on ColorRGB, ColorRGBA, ColorRGBX objects
Not too much, i have forked choma and pixie and changed byte order to bgra for tests
You have not touched any of the SIMD, nor any of the other operations that can happen with an image or mask. This is not even close to how much code will need to be written. Does every single test run by nimble test
pass without any of the output images changing?
Oh seems like no I didn't know that your code is so dependent on byte order
I don't think compositor creates GPU context for every window that we create, send image to GPU and create GPU context is different operations
For GPU accelerated desktop compositing, everything must be on the GPU. If a desktop is not doing GPU compositing, that'd be something maybe Linux can do? Certainly not modern Mac or Windows. Either way, this isn't really a big deal.
Take a step back and consider there are only a couple scenarios that can matter:
1) If you want a Window without a GPU context and 2) If you want to send BGRA directly to a system API
If 2) is what you care about (it sounds like it), you'll either be updating the window frequently or infrequently.
If frequently (like every frame), this is a very bad idea. It is not wise to take on tons of CPU compute to save a tiny bit of RAM. You'll make computers sound like jets trying to keep the CPU cool. Look at Boxy and break things into discrete composed parts/rectangles/widgets that can be individually rendered with Pixie only when needed, then composited by the GPU and build up from there. (Even after doing this, you should still only update the window when something changes.)
If infrequently, a fast converter to BGRA is going to be irrelevant to performance and is obviously a better idea. Its like ~20 lines of code (even with SIMD which will drastically increase perf) and won't even be measurable if you only re-draw the window when something changes (which may not happen for seconds or minutes or even hours at a time).
There's a whole additional level of pain here I just realized.
Pixie use premultiplied alpha (RGBX) which is required for performant compositing. I assume the system APIs want straight alpha / not premulitplied alpha BGRA? If that case, we would need premulitplied alpha BGRA (BGRX?) and a converter to BGRA.
Converting back to straight alpha is always horribly slow. There's no good way to make these legacy APIs happy other than simply not using them.
If you care about an additional 40-50 MB RAM compared to the compute trade-off, that seems unexpected to me. When is that little bit of RAM critical to save and you also have tons of spare compute to take on the CPU straight-alpha conversion?
Unless you can always assume opaque (a = 255) Images and thus can skip the conversion to straight alpha, but we cannot assume that at the library level.
bgra byte order is standard for software rendering on window