Closed TheJackiMonster closed 2 years ago
For what it's worth I tried this PR and on my test image ran 3 times each, on average this runs 0.13% slower, but that's within the margin of error anyway. At least the output files were identical (hash).
It generally depends on the GPU and its drivers. For me it was a little faster in average but theoretically can branches slow down the GPUs scheduler because it can cause the workers to get out of sync.
I removed the branch from the preprocessing and postprocessing compute shaders which were only for convenient indexing and replaced them by a one-liner to calculate the same index for both cases without branching.
This should improve performance by some percentages because code will perform better in parallel without branches.
I also changed the local_size from 32x32x3 to 32x32x1 because GPUs really tend to favor a power of 2 as local_size and the size should fit better to match different channel counts.