Closed jkwak-work closed 7 months ago
The UE5 integration is creating a slangGlobalSession
every time when compiling an HLSL file.
creating global session is a very expensive operation and is only intended to be called once by the application, and having the resulting global session shared for the entire application life-time. This should cut a lot of compile time from the benchmark.
We still need to investigate the increased dxc time though.
There are 10 shaders for UE TSR. And there are 40 permutations. The compile time for most of shaders are unchanged and they take under 1 second.
I found that following five permutations are taking longer with DXC for compiling the HLSL output from Slang.
TSRRejectShading.usf permutation 0: compile time increased from 6s to 17s
TSRRejectShading.usf permutation 1: compile time increased from 15s to 123s
TSRRejectShading.usf permutation 2: compile time increased from 2s to 3s
TSRRejectShading.usf permutation 3: compile time increased from 8s to 20s
TSRRejectShading.usf permutation 4: compile time increased from 19s to 119s
TSRRejectShading.usf permutation 5: compile time increased from 2s to 4s
Note that all of them are from the same shader, TSRRejectShading.usf. And this one is the one that uses "enum" workaround.
Since we have a fix for the "AnyValue16" issue, I am going to replace "enum" with the existential type approach. Because "enum" workaround wasn't efficient, I expect to have an improved compile time with the new approach.
Unfortunately, Slang is generating a lot of boilerplate with the existential type approach too, so it is likely that we are still going to suffer from the compile time issue.
This is because with the way the code is written, we are falling to the path that generate dynamic dispatch logic from the existential type code. DXC will eventually find out all the RTTI stuff we generate are actually static resolvable after inlining everything, but still that will result a lot of time spend during the optimziation passes.
We can try something like:
interface IOperation
{
static T init<T>(inout T state);
static T add<T>(inout T state, T val);
static T finalize<T>(T state);
}
If we stay out of generic interfaces, then we can just use ordinary interfaces with ordinary generic functions, and things should be nice to the compiler.
We can try something like:
interface IOperation { static T init<T>(inout T state); static T add<T>(inout T state, T val); static T finalize<T>(T state); }
You read my mind. That is what I was going to try next. I will share an update after making the changes and try out.
I have made some improvement on my enum WAR in TSRRejectShader.usf. https://gitlab-master.nvidia.com/jkwak/UnrealEngine_ngx/-/commit/9e8f1025a4d2123ffd0d78e8b2bf0bed6faef146
The improvement removed some of unnecessary dynamic branching logics.
DXC takes still longer to compile the shaders but not as much as before.
TSRRejectShading.usf permutation 0: compile time increased from 6s to 9s (down from 17s)
TSRRejectShading.usf permutation 1: compile time increased from 15s to 48s (down from 123s)
TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (down from 3s)
TSRRejectShading.usf permutation 3: compile time increased from 8s to 11s (down from 20s)
TSRRejectShading.usf permutation 4: compile time increased from 19s to 55s (down from 119s)
TSRRejectShading.usf permutation 5: compile time increased from 2s to 3s (down from 4s)
With the improved WAR, the runtime is also improved. The previous observation was that the runtime of TSR went up by 30% when Slang generated the hlsl. With the improved WAR, the runtime is still little slower but not by far, 4%.
Without Slang, TSR takes 0.79ms on my current setup. With Slang, TSR takes 0.82ms on the same setup.
Thanks to Yong, I made some changes based on his suggestions. And the compile time went down little more. https://gitlab-master.nvidia.com/jkwak/UnrealEngine_ngx/-/commit/ec3aab179c60c41918457a016bd3339bb9529d69 It was only a few lines of changes but the improvement was more than expected.
TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (down from 9s)
TSRRejectShading.usf permutation 1: compile time increased from 15s to 36s (down from 48s)
TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as 2s)
TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (down from 11s)
TSRRejectShading.usf permutation 4: compile time increased from 19s to 32s (down from 55s)
TSRRejectShading.usf permutation 5: compile time increased from 2s to 3s (same as 3s)
It really came down to the permutation 1 and 4. The differences on the permutations are following:
0 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=64
1 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=32
2 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=0
3 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=64
4 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=32
5 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=0
When I compare the permutation 0 and 1, I don't see anything obviously different other than permutation 1 will unroll little more.
The runtime perf might be due to us not supporting wavesize. We need to support it.
There is already an issue here: https://github.com/shader-slang/slang/issues/3385
Note that when COMPILER_SUPPORTS_WAVE_SIZE is defined, there will be different code. And that code is likely simpler to compile and run faster.
I tried to see if the runtime performance is also improved. But it is hard to tell when I am using RDP. The number fluctuate a lot more and I cannot read a reliable number. I will try it on Monday again.
For COMPILER_SUPPORTS_WAVE_SIZE, I am not sure if it is a factor for the increased compile-time/runtime, because it is not defined when Slang is used and not used.
My understanding is that when not using slang, dxc sees the HLSL source with WaveSize and is compiling the other branch. When using slang we no longer generate code that uses WaveSize which can be mire complicated and runs slower.
The only difference on the macro define when Slang is used is that "COMPILER_SLANG" is set to 1.
if (Input.Environment.CompilerFlags.Contains(CFLAG_UseSlang))
{
AdditionalDefines.SetDefine(TEXT("COMPILER_SLANG"), 1);
}
The rest of macro defines are exactly the same with and without Slang.
But I also see that there is a
...
And those things may have more profound consequences.
Oh! You are right. I almost forget about this.
#if SM6_PROFILE && !COMPILER_SLANG
#define COMPILER_SUPPORTS_WAVE_SIZE 1
#define WAVESIZE(N) [WaveSize(N)]
#endif
It looks definitely related.
I guess when the issue 3385 is resolved, we should revisit the compile-time issue. I think all of the other compile time looks fine.
I have a local change for Slang that uses [WaveSize(N)]. With the change, the compile time went down but only by little.
TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (same as before)
TSRRejectShading.usf permutation 1: compile time increased from 15s to 33s (down from 36s)
TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as before)
TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (same as before)
TSRRejectShading.usf permutation 4: compile time increased from 19s to 30s (down from 32s)
TSRRejectShading.usf permutation 5: compile time increased from 2s to 2s (down from 3s)
This is still good news. My performance PR is worth a try here.
From: Jay Kwak @.> Sent: Monday, April 1, 2024 4:49:35 PM To: shader-slang/slang @.> Cc: Yong He @.>; Comment @.> Subject: Re: [shader-slang/slang] DXC takes longer to compile when UE TSR shaders are handled by Slang (Issue #3805)
I have a local change for Slang that uses [WaveSize(N)]. With the change, the compile time went down but only by little.
TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (same as before) TSRRejectShading.usf permutation 1: compile time increased from 15s to 33s (down from 36s) TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as before) TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (same as before) TSRRejectShading.usf permutation 4: compile time increased from 19s to 30s (down from 32s) TSRRejectShading.usf permutation 5: compile time increased from 2s to 2s (down from 3s)
— Reply to this email directly, view it on GitHubhttps://github.com/shader-slang/slang/issues/3805#issuecomment-2030806808, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAUHRBLJPUEUER3KR3AYBLDY3HXA7AVCNFSM6AAAAABFAE7J32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHAYDMOBQHA. You are receiving this because you commented.Message ID: @.***>
It appears that the cause of slower runtime is same as this. In other words, TSRRejectShading.usf using WaveSize=32 increases both compile-time and runtime.
OK. We are probably unlikely to make any further speculative progress without bisecting into the functions and figure out what is causing the difference.
I am going to close this issue and open a new issue for the specific compile permutation setting.
When we wire Slang in UE, Slang takes the input HLSL and generates another HLSL. The output HLSL from Slang is sent to DXC. In this case, the compile time of DXC takes unexpectedly longer than when Slang was not there.
Without Slang, DXC takes 63.49s for 38 TSR permutations. With Slang and DXC, Slang takes 44.44s and DXC takes 157.11s; total 201.55s.
We will need to investigate why DXC takes longer.