Closed rasmusgo closed 1 year ago
The KHR_pipeline_executable_properties
for the pbr fragment shader according to renderdoc for oculus looks like this before vs after:
======== Fragment Shader ========
Adreno Shader
==== Statistics ====
Subgroup Size: 64 // the subgroup size with which this executable is dispatched.
Instruction count all: 1940 // Total count of all shader instructions. Complex shaders with high instruction counts may have long execution times.
ALU instruction count 32bit: 758 // Total count of all 32bit ALU shader instructions.
ALU instruction count 16bit: 0 // Total count of all 16bit ALU shader instructions. Note: 16bit ALU instructions perform better and use less register space than 32bit instructions.
Complex instruction count: 66 // Total count of all complex instructions (sin, cos, etc).
Texture read instruction count: 35 // Total count of all texture read instructions. Generally, VkImage reads, with or without a VkSampler. Also includes input attachment reads.
Flow control instruction count: 82 // Total count of all flow control instructions.
Barrier and fence instruction count: 0 // Total count of all barrier and fence instructions. Generally, op*Barrier instructions in the shader.
Short latency sync instruction count: 65 // Total count of all short latency sync instructions.
Long latency sync instruction count: 13 // Total count of all long latency sync instructions.
Full precision register footprint per shader instance: 13 // Number of 128bit registers used by each shader instance. Each 128bit register may store 4 FP32 values.
Half precision register footprint per shader instance: 5 // Number of 64bit registers used by each shader instance. Each 64bit register may store 4 FP16 values.
Overall register footprint per shader instance: 13 // Number of 128bit registers used by each shader instance. Each 128bit register may store 4 FP32 values, or 8 FP16 values.
Scratch memory usage per shader instance: 0 // Number of 128bit slots of scratch memory used by each shader instance. Warning: If the shader uses any scratch memory, it will perform poorly.
Output component count: 4 // Total count of all shader stage output components.
Input component count: 10 // Total count of all shader stage input components.
Shader processor utilization percentage: 25 // The maximum shader processor utilization for the shader. Warning: If this number is low, the shader may perform poorly.
Memory read instruction count: 0 // Total count of all memory read instructions. Generally, VkImage/VkBuffer reads through a storage descriptor.
Memory write instruction count: 0 // Total count of all memory write instructions. Generally, VkImage/VkBuffer writes through a storage descriptor.
======== Fragment Shader ========
Adreno Shader
==== Statistics ====
Subgroup Size: 64 // the subgroup size with which this executable is dispatched.
Instruction count all: 2016 // Total count of all shader instructions. Complex shaders with high instruction counts may have long execution times.
ALU instruction count 32bit: 123 // Total count of all 32bit ALU shader instructions.
ALU instruction count 16bit: 690 // Total count of all 16bit ALU shader instructions. Note: 16bit ALU instructions perform better and use less register space than 32bit instructions.
Complex instruction count: 46 // Total count of all complex instructions (sin, cos, etc).
Texture read instruction count: 35 // Total count of all texture read instructions. Generally, VkImage reads, with or without a VkSampler. Also includes input attachment reads.
Flow control instruction count: 83 // Total count of all flow control instructions.
Barrier and fence instruction count: 0 // Total count of all barrier and fence instructions. Generally, op*Barrier instructions in the shader.
Short latency sync instruction count: 56 // Total count of all short latency sync instructions.
Long latency sync instruction count: 13 // Total count of all long latency sync instructions.
Full precision register footprint per shader instance: 9 // Number of 128bit registers used by each shader instance. Each 128bit register may store 4 FP32 values.
Half precision register footprint per shader instance: 18 // Number of 64bit registers used by each shader instance. Each 64bit register may store 4 FP16 values.
Overall register footprint per shader instance: 9 // Number of 128bit registers used by each shader instance. Each 128bit register may store 4 FP32 values, or 8 FP16 values.
Scratch memory usage per shader instance: 0 // Number of 128bit slots of scratch memory used by each shader instance. Warning: If the shader uses any scratch memory, it will perform poorly.
Output component count: 4 // Total count of all shader stage output components.
Input component count: 10 // Total count of all shader stage input components.
Shader processor utilization percentage: 37 // The maximum shader processor utilization for the shader. Warning: If this number is low, the shader may perform poorly.
Memory read instruction count: 0 // Total count of all memory read instructions. Generally, VkImage/VkBuffer reads through a storage descriptor.
Memory write instruction count: 0 // Total count of all memory write instructions. Generally, VkImage/VkBuffer writes through a storage descriptor.
These numbers look promising but I did not see any difference in performance in the many helmets benchmark.
It looks like we are bound by texture lookups, not computations.
I think this PR has now been mostly merged into the perf branch. Thanks for doing the initial work on this!
~Almost working as it should. There is a problem with specular highlight that I am trying to pin down at the moment.~
~I also made a convenience thing for updating the pbr rendering tests if
UPDATE_IMAGES
env var is truthy. That might belong in a separate PR.~I have fixed several issues with the spotlights and I think this is starting to look good enough now. There is a difference with rendering reflections of spotlights on very shiny surfaces. The new code is more likely to show a reflection of the spotlight but it it flickers. This is visible in the normal tangents test scene but not on the helmet.
I ran the many helmets benchmark but was surprised to see no difference in performance. Perhaps the bottleneck is somewhere else?