Fast gamma correction routines

LoganDark commented 2 years ago

sRGB needs to be gamma-corrected in order to perform correct blending. Right now I'm using a home-grown implementation that has to convert to floating-point in order to do the transfer function, but I would love to have fast pure-int or SIMD routines. Is that in scope for this library?

thomcc commented 2 years ago

Hmm, I think you might be either confused, or perhaps you're in a weird situation. That is, your question seems a bit odd because sRGB already contains a gamma transfer function (nominally gamma = 2.2).

So while "sRGB needs to be gamma-corrected in order to perform correct blending" is essentially accurate, this is already what this library provides.

So, can you elaborate a bit on what you're doing and looking for?

LoganDark commented 2 years ago

Sorry, let me try to clarify. This library currently contains routines for converting between sRGB8 uints and linear-RGB floats. I'm asking if converting between sRGB8 and Linear RGB8 (or RGB16) (not floats) is in-scope. My use-case is an alpha-blending function which needs to input/output sRGB8 but do the actual blending in Linear RGB. Conversion to and from float makes blending slower by about an order of magnitude according to my benchmarks.

gamma 2.2 is not quite the correct curve, so I don't use it (I use a proper transfer function). Currently my int-only blending function takes about 5 ns per operation but I wonder if using SIMD (instead of blending each component in isolation) would make it faster. Problem is, I have to use lookup tables right now and those don't exactly SIMD lol. (plus they add about 64KB to binary size)

thomcc commented 2 years ago

Hmm, linear RGB8 is very low precision, to the point where it tends to be a mistake (this is why sRGB exists, after all). I think I would discourage using it, and would probably not want to include it, since people have a hard enough time figuring out what to do.

By RGB16, you mean like with each channel as a number between 0..=u16::MAX (e.g. as 16 bit UNORM or similar)? I see. There are also good use cases for 10 bpc linear RGB too, and such (if you instead mean half precision floats, I'd have to think about it). You can pretty easily do this by e.g. quantizing the output of srgb8_to_f32 to e.g. 16 bits.

Doing this correctly is something people tend to get wrong (usually by mixing up their conventions so centered quantization is done at one end and floored at the other), and the correct implementation is not much code, so I'm not opposed to it. But, I would likely be against trying to implement it without performing any floating point operations, since it's not worth it.

That said none of this involves transfer functions, so I'm still confused as to what you expect this to look like. (Also, you are correct that a gamma=2.2 curve is not exactly the same as the sRGB transfer function, but the sRGB transfer function still converts from a gamma to linear radiation space).

LoganDark commented 2 years ago

Hmm, linear RGB8 is very low precision, to the point where it tends to be a mistake

Yeah, converting to Linear RGB8 would completely clobber the whole point of having any gamma curve at all, which is why I currently use RGB16 as the intermediate.

Perhaps using something in the middle like RGB12 would significantly reduce the size of the lookup table required. Still a LUT though.

By RGB16, you mean like with each channel as a number between 0..=u16::MAX (e.g. as 16 bit UNORM or similar)?

Yep.

That said none of this involves transfer functions

It kinda does, the process of converting sRGB to linear RGB (and vice versa) requires a transfer function. Without one, it would be as simple as channel as f32 / 255.0, but that would just convert to floating-point sRGB, not linear RGB. (Maybe it would be simpler to pretend that I never mentioned transfer functions.)

thomcc commented 2 years ago

Okay, I think I understand. I think the answer is: yes, this is largely in scope, for the right API and a reasonable implementation that justifies its own inclusion. I don't think I have time to work on this, though.

IME you'll have a very hard time beating array lookup for something like this though. It doesn't SIMD well, but if the table is in the cache (and for a 256 element table with a hot conversion loop it should be often enough) it will be very cheap. That said, I guess you indicated you are seeing this as a bottleneck already.

LoganDark commented 2 years ago

IME you'll have a very hard time beating array lookup for something like this though. It doesn't SIMD well, but if the table is in the cache (and for a 256 element table with a hot conversion loop it should be often enough) it will be very cheap. That said, I guess you indicated you are seeing this as a bottleneck already.

I can't tell if it's still a bottleneck. Eventually the overhead of shaping and rasterizing strings of text every frame will outweigh the overhead of blending it all. I've already observed a huge perf increase just by optimizing the blending function so I'm trying to gain as much as I can out of that before shifting my focus. With that said, I don't have access to a working profiler at the moment (https://github.com/flamegraph-rs/flamegraph/issues/207) so all I can do is see what effect changing the code has.

thomcc / fast-srgb8

Fast gamma correction routines #1