`GLES3` shadeless colors are incorrect compared to `Vulkan` [3D]

WhalesState commented 2 months ago

Tested versions

4.3 stable

System information

Windows 10

Issue description

Left side is a SubViewport contains a MeshInstance3D plane. Right side is a Sprite2D

GLES3: GLES3

Vulkan: vulkan

Steps to reproduce

N/A

Minimal reproduction project (MRP)

minimal_reproduction_project.zip

clayjohn commented 2 months ago

I suspect this comes from the sRGB conversion approximation we use in the compatibility renderer. We use a version of the sRGB conversion function that is much cheaper to compute, but slightly less accurate. The difference is very subtle, which makes me think it isn't something else more fundamental

WhalesState commented 2 months ago

What's the cost of using the accurate conversion ?

I see it's only used in tonemap_inc.glsl, copy.glsl and cubemap_filter.glsl only, but canvas.glsl and copy_to_fb.glsl are using the accurate conversion.

// Approximation from http://chilliant.blogspot.com/2012/08/srgb-approximations-for-hlsl.html
vec3 srgb_to_linear(vec3 color) {   
    return color * (color * (color * 0.305306011 + 0.682171111) + 0.012522878);
}

// Accurate
vec3 srgb_to_linear(vec3 color) {
    return mix(pow((color.rgb + vec3(0.055)) * (1.0 / (1.0 + 0.055)), vec3(2.4)), color.rgb * (1.0 / 12.92), lessThan(color.rgb, vec3(0.04045)));
}

If using the accurate function is bad for performance, so we can make this optional for Vulkan and GLES3 especially for low end devices and weak GPU.

Currently the 3D editor became so laggy on my potatoe pc when using Vulkan but switching to D3D12 makes it much faster, any performance optimization will be appreciated especially for Editor. Thanks in advance and have a nice day ^

WhalesState commented 2 months ago

I have asked Codeium about the cost, and this was the reply!

The cost of using this function to convert every pixel from sRGB to linear color space can be significant, especially for high-resolution images or real-time rendering applications.

Here's a breakdown of the operations involved in the function:

lessThan comparison: 1 operation
mix function: 1 operation
pow function: 1 operation (for the non-linear transformation)
Multiplications: 4 operations (2 for the linear transformation, 2 for the non-linear transformation)
Additions: 2 operations (1 for the linear transformation, 1 for the non-linear transformation)

Assuming a modern GPU or CPU, the cost of these operations can be estimated as follows:

lessThan comparison: ~1-2 clock cycles
mix function: ~2-4 clock cycles
pow function: ~10-20 clock cycles (depending on the implementation and precision)
Multiplications: ~2-4 clock cycles each
Additions: ~1-2 clock cycles each

Overall, the total cost of the function can be estimated to be around 20-50 clock cycles per pixel. This may not seem like a lot, but when applied to every pixel in an image, the cost can add up quickly.

For example, for a 1080p image (1920x1080 pixels), the total cost would be:

20-50 clock cycles/pixel x 2,073,600 pixels = 41,472,000 - 103,680,000 clock cycles

This can be a significant portion of the total processing time for an image, especially if the function is called repeatedly or in a performance-critical loop.

To mitigate this cost, some possible optimizations could be:

Using a lookup table (LUT) to precompute the sRGB to linear conversion for a range of values
Using a simpler, approximate conversion formula that is faster but less accurate
Using SIMD instructions (e.g. SSE, AVX) to perform the conversion in parallel for multiple pixels at once
Using a GPU-accelerated implementation of the function, which can take advantage of the GPU's massively parallel architecture to perform the conversion much faster.

WhalesState commented 2 months ago

I have tested both cheap and accurate functions in godot on a 4096 texture and there was no noticable difference, both was slow.

shader_type canvas_item;

uniform bool cheap = false;

vec3 srgb_to_linear(vec3 color) {
    return mix(pow((color + vec3(0.055)) * (1.0 / (1.0 + 0.055)), vec3(2.4)), color * (1.0 / 12.92), lessThan(color, vec3(0.04045)));
}

vec3 srgb_to_linear_cheap(vec3 color) {
    return color * (color * (color * 0.305306011 + 0.682171111) + 0.012522878);
}

vec3 linear_to_srgb(vec3 color) {
    return mix(color * 12.92, 1.055 * pow(color, vec3(1.0 / 2.4)) - 0.055, greaterThanEqual(color, vec3(0.0031308)));
}

vec3 linear_to_srgb_cheap(vec3 color) {
    return max(vec3(1.055) * pow(color, vec3(0.416666667)) - vec3(0.055), vec3(0.0));
}

void fragment() {
    vec4 tex = texture(TEXTURE, UV);
    vec3 col = cheap ? srgb_to_linear_cheap(tex.rgb) : srgb_to_linear(tex.rgb);
    if (UV.x <= 0.33333) {
        COLOR.rgb = col;
    } else if (UV.x <= 0.66666) {
        COLOR.rgb = cheap ? linear_to_srgb_cheap(col) : linear_to_srgb(col);
    }
}

In some cases the accurate function can be faster than the cheaper one.

Codeium reason:

Branch prediction: In the accurate function, the mix function is used to select between two different calculations based on the value of color. This can lead to better branch prediction, as the GPU can more easily predict which calculation to perform. In contrast, the cheaper function uses a series of multiplications and additions, which can lead to more unpredictable branching.
Instruction-level parallelism: The accurate function uses a single pow instruction, which can be executed in parallel with other instructions. In contrast, the cheaper function uses multiple multiplications and additions, which may not be able to be executed in parallel.
GPU architecture: Modern GPUs are designed to handle complex calculations like pow and mix more efficiently than simple arithmetic operations like multiplications and additions. This is because these complex calculations can be executed in parallel across multiple processing units.
Texture sampling: The accurate function only requires a single texture sample, whereas the cheaper function requires multiple texture samples (although this is not the case in your specific code). Reducing the number of texture samples can improve performance.

It's worth noting that the performance difference between the two functions may not be significant, and may vary depending on the specific hardware and use case. However, in general, it's not uncommon for more complex calculations to be faster than simpler ones on modern GPUs, due to the reasons mentioned above.

clayjohn commented 2 months ago

@WhalesState Its good to see you are taking an interest in performance. Unfortunately, you have been misled, most of what Codeium has told you is totally wrong and much of it is self-contradictory. Particularly the second post is 100% wrong

srgb_to_linear and linear_to_srgb don't take a large part of the frame, so the impact from using the approximation is hard to measure. On Desktop hardware I don't think you would be able to measure the difference. The optimization is more for mobile hardware where extra instructions are costlier.

godotengine / godot