shader-slang / slang

Making it easier to work with shaders
http://shader-slang.com
Other
2.91k stars 210 forks source link

DXC takes longer to compile when UE TSR shaders are handled by Slang #3805

Closed jkwak-work closed 7 months ago

jkwak-work commented 8 months ago

When we wire Slang in UE, Slang takes the input HLSL and generates another HLSL. The output HLSL from Slang is sent to DXC. In this case, the compile time of DXC takes unexpectedly longer than when Slang was not there.

Without Slang, DXC takes 63.49s for 38 TSR permutations. With Slang and DXC, Slang takes 44.44s and DXC takes 157.11s; total 201.55s.

We will need to investigate why DXC takes longer.

csyonghe commented 8 months ago

The UE5 integration is creating a slangGlobalSession every time when compiling an HLSL file. creating global session is a very expensive operation and is only intended to be called once by the application, and having the resulting global session shared for the entire application life-time. This should cut a lot of compile time from the benchmark.

We still need to investigate the increased dxc time though.

jkwak-work commented 8 months ago

There are 10 shaders for UE TSR. And there are 40 permutations. The compile time for most of shaders are unchanged and they take under 1 second.

I found that following five permutations are taking longer with DXC for compiling the HLSL output from Slang.

 TSRRejectShading.usf permutation 0: compile time increased from 6s to 17s
 TSRRejectShading.usf permutation 1: compile time increased from 15s to 123s
 TSRRejectShading.usf permutation 2: compile time increased from 2s to 3s
 TSRRejectShading.usf permutation 3: compile time increased from 8s to 20s
 TSRRejectShading.usf permutation 4: compile time increased from 19s to 119s
 TSRRejectShading.usf permutation 5: compile time increased from 2s to 4s

Note that all of them are from the same shader, TSRRejectShading.usf. And this one is the one that uses "enum" workaround.

Since we have a fix for the "AnyValue16" issue, I am going to replace "enum" with the existential type approach. Because "enum" workaround wasn't efficient, I expect to have an improved compile time with the new approach.

csyonghe commented 8 months ago

Unfortunately, Slang is generating a lot of boilerplate with the existential type approach too, so it is likely that we are still going to suffer from the compile time issue.

csyonghe commented 8 months ago

This is because with the way the code is written, we are falling to the path that generate dynamic dispatch logic from the existential type code. DXC will eventually find out all the RTTI stuff we generate are actually static resolvable after inlining everything, but still that will result a lot of time spend during the optimziation passes.

csyonghe commented 8 months ago

We can try something like:

interface IOperation
{
      static T init<T>(inout T state);
      static T add<T>(inout T state, T val);
      static T finalize<T>(T state);
}
csyonghe commented 8 months ago

If we stay out of generic interfaces, then we can just use ordinary interfaces with ordinary generic functions, and things should be nice to the compiler.

jkwak-work commented 8 months ago

We can try something like:

interface IOperation
{
      static T init<T>(inout T state);
      static T add<T>(inout T state, T val);
      static T finalize<T>(T state);
}

You read my mind. That is what I was going to try next. I will share an update after making the changes and try out.

jkwak-work commented 8 months ago

I have made some improvement on my enum WAR in TSRRejectShader.usf. https://gitlab-master.nvidia.com/jkwak/UnrealEngine_ngx/-/commit/9e8f1025a4d2123ffd0d78e8b2bf0bed6faef146

The improvement removed some of unnecessary dynamic branching logics.

DXC takes still longer to compile the shaders but not as much as before.

 TSRRejectShading.usf permutation 0: compile time increased from 6s to 9s (down from 17s)
 TSRRejectShading.usf permutation 1: compile time increased from 15s to 48s (down from 123s)
 TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (down from 3s)
 TSRRejectShading.usf permutation 3: compile time increased from 8s to 11s (down from 20s)
 TSRRejectShading.usf permutation 4: compile time increased from 19s to 55s (down from 119s)
 TSRRejectShading.usf permutation 5: compile time increased from 2s to 3s (down from 4s)
jkwak-work commented 8 months ago

With the improved WAR, the runtime is also improved. The previous observation was that the runtime of TSR went up by 30% when Slang generated the hlsl. With the improved WAR, the runtime is still little slower but not by far, 4%.

Without Slang, TSR takes 0.79ms on my current setup. With Slang, TSR takes 0.82ms on the same setup.

jkwak-work commented 8 months ago

Thanks to Yong, I made some changes based on his suggestions. And the compile time went down little more. https://gitlab-master.nvidia.com/jkwak/UnrealEngine_ngx/-/commit/ec3aab179c60c41918457a016bd3339bb9529d69 It was only a few lines of changes but the improvement was more than expected.

TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (down from 9s)
TSRRejectShading.usf permutation 1: compile time increased from 15s to 36s (down from 48s)
TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as 2s)
TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (down from 11s)
TSRRejectShading.usf permutation 4: compile time increased from 19s to 32s (down from 55s)
TSRRejectShading.usf permutation 5: compile time increased from 2s to 3s (same as 3s)

It really came down to the permutation 1 and 4. The differences on the permutations are following:

0 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=64
1 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=32
2 : DIM_FLICKERING_DETECTION=0 DIM_WAVE_SIZE=0
3 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=64
4 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=32
5 : DIM_FLICKERING_DETECTION=1 DIM_WAVE_SIZE=0

When I compare the permutation 0 and 1, I don't see anything obviously different other than permutation 1 will unroll little more.

csyonghe commented 8 months ago

The runtime perf might be due to us not supporting wavesize. We need to support it.

csyonghe commented 8 months ago

There is already an issue here: https://github.com/shader-slang/slang/issues/3385

csyonghe commented 8 months ago

Note that when COMPILER_SUPPORTS_WAVE_SIZE is defined, there will be different code. And that code is likely simpler to compile and run faster.

jkwak-work commented 8 months ago

I tried to see if the runtime performance is also improved. But it is hard to tell when I am using RDP. The number fluctuate a lot more and I cannot read a reliable number. I will try it on Monday again.

For COMPILER_SUPPORTS_WAVE_SIZE, I am not sure if it is a factor for the increased compile-time/runtime, because it is not defined when Slang is used and not used.

csyonghe commented 8 months ago

My understanding is that when not using slang, dxc sees the HLSL source with WaveSize and is compiling the other branch. When using slang we no longer generate code that uses WaveSize which can be mire complicated and runs slower.

jkwak-work commented 8 months ago

The only difference on the macro define when Slang is used is that "COMPILER_SLANG" is set to 1.

    if (Input.Environment.CompilerFlags.Contains(CFLAG_UseSlang))
    {
        AdditionalDefines.SetDefine(TEXT("COMPILER_SLANG"), 1);
    }

https://gitlab-master.nvidia.com/jkwak/UnrealEngine_ngx/-/commit/5bb9bcd16e984a26d678b895c611c569015bf67e

The rest of macro defines are exactly the same with and without Slang.

csyonghe commented 8 months ago

But I also see that there is a

ifdef COMPILER_SLANG

define COMPILER_SUPPORTS_WAVE_SIZE 0

...

endif

And those things may have more profound consequences.

jkwak-work commented 8 months ago

Oh! You are right. I almost forget about this.

#if SM6_PROFILE && !COMPILER_SLANG
        #define COMPILER_SUPPORTS_WAVE_SIZE 1
        #define WAVESIZE(N) [WaveSize(N)]
#endif

It looks definitely related.

I guess when the issue 3385 is resolved, we should revisit the compile-time issue. I think all of the other compile time looks fine.

jkwak-work commented 7 months ago

I have a local change for Slang that uses [WaveSize(N)]. With the change, the compile time went down but only by little.

TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (same as before)
TSRRejectShading.usf permutation 1: compile time increased from 15s to 33s (down from 36s)
TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as before)
TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (same as before)
TSRRejectShading.usf permutation 4: compile time increased from 19s to 30s (down from 32s)
TSRRejectShading.usf permutation 5: compile time increased from 2s to 2s (down from 3s)
csyonghe commented 7 months ago

This is still good news. My performance PR is worth a try here.


From: Jay Kwak @.> Sent: Monday, April 1, 2024 4:49:35 PM To: shader-slang/slang @.> Cc: Yong He @.>; Comment @.> Subject: Re: [shader-slang/slang] DXC takes longer to compile when UE TSR shaders are handled by Slang (Issue #3805)

I have a local change for Slang that uses [WaveSize(N)]. With the change, the compile time went down but only by little.

TSRRejectShading.usf permutation 0: compile time increased from 6s to 8s (same as before) TSRRejectShading.usf permutation 1: compile time increased from 15s to 33s (down from 36s) TSRRejectShading.usf permutation 2: compile time increased from 2s to 2s (same as before) TSRRejectShading.usf permutation 3: compile time increased from 8s to 8s (same as before) TSRRejectShading.usf permutation 4: compile time increased from 19s to 30s (down from 32s) TSRRejectShading.usf permutation 5: compile time increased from 2s to 2s (down from 3s)

— Reply to this email directly, view it on GitHubhttps://github.com/shader-slang/slang/issues/3805#issuecomment-2030806808, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAUHRBLJPUEUER3KR3AYBLDY3HXA7AVCNFSM6AAAAABFAE7J32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHAYDMOBQHA. You are receiving this because you commented.Message ID: @.***>

jkwak-work commented 7 months ago

It appears that the cause of slower runtime is same as this. In other words, TSRRejectShading.usf using WaveSize=32 increases both compile-time and runtime.

csyonghe commented 7 months ago

OK. We are probably unlikely to make any further speculative progress without bisecting into the functions and figure out what is causing the difference.

jkwak-work commented 7 months ago

I am going to close this issue and open a new issue for the specific compile permutation setting.