Use float16 in fragment shader

rasmusgo commented 2 years ago

~Almost working as it should. There is a problem with specular highlight that I am trying to pin down at the moment.~

~I also made a convenience thing for updating the pbr rendering tests if UPDATE_IMAGES env var is truthy. That might belong in a separate PR.~

I have fixed several issues with the spotlights and I think this is starting to look good enough now. There is a difference with rendering reflections of spotlights on very shiny surfaces. The new code is more likely to show a reflection of the spotlight but it it flickers. This is visible in the normal tangents test scene but not on the helmet.

I ran the many helmets benchmark but was surprised to see no difference in performance. Perhaps the bottleneck is somewhere else?

rasmusgo commented 2 years ago

The KHR_pipeline_executable_properties for the pbr fragment shader according to renderdoc for oculus looks like this before vs after:

======== Fragment Shader ========

Adreno Shader

==== Statistics ====

Subgroup Size: 64      // the subgroup size with which this executable is dispatched.
Instruction count all: 1940      // Total count of all shader instructions.  Complex shaders with high instruction counts may have long execution times.
ALU instruction count 32bit: 758      // Total count of all 32bit ALU shader instructions.
ALU instruction count 16bit: 0      // Total count of all 16bit ALU shader instructions.  Note: 16bit ALU instructions perform better and use less register space than 32bit instructions.
Complex instruction count: 66      // Total count of all complex instructions (sin, cos, etc).
Texture read instruction count: 35      // Total count of all texture read instructions.  Generally, VkImage reads, with or without a VkSampler.  Also includes input attachment reads.
Flow control instruction count: 82      // Total count of all flow control instructions.
Barrier and fence instruction count: 0      // Total count of all barrier and fence instructions.  Generally, op*Barrier instructions in the shader.
Short latency sync instruction count: 65      // Total count of all short latency sync instructions.
Long latency sync instruction count: 13      // Total count of all long latency sync instructions.
Full precision register footprint per shader instance: 13      // Number of 128bit registers used by each shader instance.  Each 128bit register may store 4 FP32 values.
Half precision register footprint per shader instance: 5      // Number of 64bit registers used by each shader instance.  Each 64bit register may store 4 FP16 values.
Overall register footprint per shader instance: 13      // Number of 128bit registers used by each shader instance.  Each 128bit register may store 4 FP32 values, or 8 FP16 values.
Scratch memory usage per shader instance: 0      // Number of 128bit slots of scratch memory used by each shader instance.  Warning: If the shader uses any scratch memory, it will perform poorly.
Output component count: 4      // Total count of all shader stage output components.
Input component count: 10      // Total count of all shader stage input components.
Shader processor utilization percentage: 25      // The maximum shader processor utilization for the shader.  Warning: If this number is low, the shader may perform poorly.
Memory read instruction count: 0      // Total count of all memory read instructions.  Generally, VkImage/VkBuffer reads through a storage descriptor.
Memory write instruction count: 0      // Total count of all memory write instructions.  Generally, VkImage/VkBuffer writes through a storage descriptor.

======== Fragment Shader ========

Adreno Shader

==== Statistics ====

Subgroup Size: 64      // the subgroup size with which this executable is dispatched.
Instruction count all: 2016      // Total count of all shader instructions.  Complex shaders with high instruction counts may have long execution times.
ALU instruction count 32bit: 123      // Total count of all 32bit ALU shader instructions.
ALU instruction count 16bit: 690      // Total count of all 16bit ALU shader instructions.  Note: 16bit ALU instructions perform better and use less register space than 32bit instructions.
Complex instruction count: 46      // Total count of all complex instructions (sin, cos, etc).
Texture read instruction count: 35      // Total count of all texture read instructions.  Generally, VkImage reads, with or without a VkSampler.  Also includes input attachment reads.
Flow control instruction count: 83      // Total count of all flow control instructions.
Barrier and fence instruction count: 0      // Total count of all barrier and fence instructions.  Generally, op*Barrier instructions in the shader.
Short latency sync instruction count: 56      // Total count of all short latency sync instructions.
Long latency sync instruction count: 13      // Total count of all long latency sync instructions.
Full precision register footprint per shader instance: 9      // Number of 128bit registers used by each shader instance.  Each 128bit register may store 4 FP32 values.
Half precision register footprint per shader instance: 18      // Number of 64bit registers used by each shader instance.  Each 64bit register may store 4 FP16 values.
Overall register footprint per shader instance: 9      // Number of 128bit registers used by each shader instance.  Each 128bit register may store 4 FP32 values, or 8 FP16 values.
Scratch memory usage per shader instance: 0      // Number of 128bit slots of scratch memory used by each shader instance.  Warning: If the shader uses any scratch memory, it will perform poorly.
Output component count: 4      // Total count of all shader stage output components.
Input component count: 10      // Total count of all shader stage input components.
Shader processor utilization percentage: 37      // The maximum shader processor utilization for the shader.  Warning: If this number is low, the shader may perform poorly.
Memory read instruction count: 0      // Total count of all memory read instructions.  Generally, VkImage/VkBuffer reads through a storage descriptor.
Memory write instruction count: 0      // Total count of all memory write instructions.  Generally, VkImage/VkBuffer writes through a storage descriptor.

The number of instructions goes up a bit.
Maximum shader processor utilization goes up from 25% to 37%.
Complex instruction count goes down from 65 to 56.
Overall register footprint per shader instance goes down from 13 to 9.

These numbers look promising but I did not see any difference in performance in the many helmets benchmark.

rasmusgo commented 2 years ago

It looks like we are bound by texture lookups, not computations.

kanerogers commented 1 year ago

I think this PR has now been mostly merged into the perf branch. Thanks for doing the initial work on this!

leetvr / hotham

Use float16 in fragment shader #402