redorav / hlslpp

Math library using HLSL syntax with multiplatform SIMD support
MIT License
597 stars 47 forks source link

How well does this cover HLSL202x? #66

Open devshgraphicsprogramming opened 1 year ago

devshgraphicsprogramming commented 1 year ago

I'm currently searching for and evaluating libraries that will let me share as much shader code with the CPU (regular functions, structs and HLSL "built-in"s) as possible.

HLSL2021 now has templates and stuff, so I'm wondering how close your type declarations, etc. are to HLSL2021 internals

For example, now matrices are defined in terms of a template in HLSL, so apparently you can do this in HLSL2021 to implement a chainRule utility function: https://github.com/microsoft/hlsl-specs/issues/24#issuecomment-1370297244

redorav commented 1 year ago

Hi @devshgraphicsprogramming hlsl++ doesn't provide templated types like what you're looking for.

Once upon a time types were declared in terms of templates such as floatNxM and floatN and it was very slow to compile and the code required to call functions and pass them around complicated to write and debug. Template resolution rules are very finicky and I had to test that functions compiled by calling them, otherwise non-instantiated templates wouldn't necessarily be validated. Lots of SFINAE and template magic.

I deleted all that and never looked back. That's probably not useful for you. However you can probably implement your own matrix<N, M> mapping to the hlsl++ types if you require it for your project and pass them into the functions as you would normally.

struct float4x4 {};

template<int N, int M>
struct matrix {};

template<>
struct matrix<4, 4> : float4x4 {};

Any function taking a float4x4 should accept that matrix<4, 4> as if it was a float4x4. If you require any support with that I can try to help out.

devshgraphicsprogramming commented 1 year ago

ok this is quite strange as @llvm-beanz says that what's going on under the hood https://github.com/microsoft/hlsl-specs/issues/24#issuecomment-1372813829

is that in DXC (which is a fork of llvm 3.7) the float and matrix types are templates and then floatN and floatNxM are aliases of these.

And you're telling me that for every floatNxM and floatN you have a separate struct? (basically the reverse of how the types are declared in DXC).

Once upon a time types were declared in terms of templates such as floatNxM and floatN and it was very slow to compile and the code required to call functions and pass them around complicated to write and debug. Template resolution rules are very finicky and I had to test that functions compiled by calling them, otherwise non-instantiated templates wouldn't necessarily be validated. Lots of SFINAE and template magic.

Any chance of going back to that with float8 and the combinatorial explosion introduces by going from 4x4 to 8x8 for the matrices?

llvm-beanz commented 1 year ago

FWIW, the vector and matrix templates aren't just an implementation detail. They are part of the language, and have been long before HLSL 2021. See:

https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-vector https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-matrix

redorav commented 1 year ago

Hi @devshgraphicsprogramming

ok this is quite strange as @llvm-beanz says that what's going on under the hood https://github.com/microsoft/hlsl-specs/issues/24#issuecomment-1372813829

It's possible that that's what happens in DXC under the hood, this library provides compatibility for practical HLSL. To be perfectly honest, there's no real need for them either unless it makes their lives easier, as there are no types larger than 4, and on GPUs it makes no sense anyway since they aren't designed around SIMD types anymore.

And you're telling me that for every floatNxM and floatN you have a separate struct? (basically the reverse of how the types are declared in DXC).

I have a separate struct for every floatNxM, but I had to anyway to specialize them for the different SIMD types. Like I said in the other bug, this is a library that is aimed at making SIMD on the CPU look like hlsl, with a few extensions here and there like the quaternions and the float8 which can be useful for batch processing of things.

I don't model the implementation according to DXC, but try to follow the interface, and there are things I cannot do like ternary operators. There is little chance anyone will need anything much more complicated when doing graphics, which is what this library is aimed at primarily. You've also noticed that the sizes of the types aren't what they say (float3 is a 4-float type).

Any chance of going back to that with float8 and the combinatorial explosion introduces by going from 4x4 to 8x8 for the matrices?

There is little chance I'll be going back to templates, but you can always fork the code if it makes your life easier. I also haven't thought of an 8x8 matrix, is this something you would need?

@llvm-beanz

I understand that it might be a part of the language, but I've been programming HLSL for over 10 years and never had to use it in a templated manner. The swizzles are part of the language as well and they have to be emulated in awkward ways in C++. Many of the matrix swizzles aren't possible or provided.

llvm-beanz commented 1 year ago

The need to use the vector and matrix templates directly is much more common with HLSL 2021 where the specific vector type you want to use might be resolved under a template-dependent context. For example you might want to implement a method that works on any 3-element vector, you could write a template like:

template <typename ElementTy>
void myFn(vector<ElementTy, 3> MyVec) {...}

There are other contexts where we've seen users historically us the vector and matrix templates under preprocessor macros, but I do prefer to pretend the C preprocessor doesn't exist.

redorav commented 1 year ago

I understand that for templated code you might want to restrict functions in that way, it's just that I've never had the need to write code like that. Consider that the only elements that realistically go in there are uint, int and float (maybe half/double these days, although I have never seen double shipped). For any sort of non-trivial code the reuse is minimal. If you can provide a real world use case for this (not just that it can be done and how) it would help the discussion. I am trying to see your point though, and perhaps for that kind of use case it might make sense.

That said, if you were to have a header that derives from the hlsl++ types you could have this kind of template behavior without even modifying the hlsl++ internals. Templates are a recent addition to DXC and I've yet to see a shader codebase use them (which doesn't mean it isn't more common, I just draw from my experience).

I get that you want to avoid the preprocessor, and all the problems surrounding it. But the thing I dislike even more is long compile times and I very much avoid templates as much as I can these days.

devshgraphicsprogramming commented 1 year ago

or any sort of non-trivial code the reuse is minimal. If you can provide a real world use case for this (not just that it can be done and how) it would help the discussion. I am trying to see your point though, and perhaps for that kind of use case it might make sense.

Let me show you my 30 minute Vulkanized 2023 talk : https://www.youtube.com/watch?v=JGiKTy_Csv8&t=1050s

Consider our statically polymorphic BxDFs (see the NDF traits and Cokk-Torrance struct) : https://github.com/Devsh-Graphics-Programming/Nabla/pull/475/files/0efa0574fcb32cc4566e61903dd112236528f23e..e2daa633d0fbbedd54ec498c80826cd6def8eadf

Or our lower_bound, upper_bound and Workgroup and Single Dispatch Scans: https://github.com/Devsh-Graphics-Programming/Nabla/pull/438/files

But the thing I dislike even more is long compile times and I very much avoid templates as much as I can these days.

explicit instantiation and extern template exists, also precompiled headers.

I don't see how a PCH (or extern template) with explicit instantiation would be slower to compile than 1 handwritten struct in the header per specialization. In-fact it should be faster, because codegen was already performed once as opposed to per-translation unit which includes the header.

redorav commented 1 year ago

Let me show you my 30 minute Vulkanized 2023 talk : https://www.youtube.com/watch?v=JGiKTy_Csv8&t=1050s

I'll take a look at the talk when I get a bit of time.

Consider our statically polymorphic BxDFs (see the NDF traits and Cokk-Torrance struct) :

I've taken a look at the provided code but I haven't found much that relies on this restriction of the dimensions of a vector. I understanding that templates can be useful and I see the templating of code like your BRDF but I'm not sure what that has to do with matrices and vectors just yet. I've also always accomplished that via defines to the shader compiler, which you probably need to pass in anyway to select them between two shaders that do different BRDFs, so I'm not quite sure what the benefit is.

I've been through the compile time pain already, and I'm not rewriting it again to be templated. If we can find a solution that satisfies you I would leave it at that. I'm still not happy with the compile times of hlsl++ as swizzle instantiation is quite expensive, I've even considered providing a define that disables them for those who don't need them, although tbh it is one of the main attractive points of the library and I don't even disable them myself.

I know about PCHs and all the other lousy hacks for C++ and I'm not happy about any of them. One of the things I need to do at some point is provide an hlsl++ module. When I programmed the bulk of the library modules weren't out and even today the support is still not widespread. I've played around with them in a toy stl I'm programming. If there was interest I could consider it. It would take a bit of effort because of all the system headers that could get included which is a bit unfortunate.

devshgraphicsprogramming commented 1 year ago

I've also always accomplished that via defines to the shader compiler, which you probably need to pass in anyway to select them between two shaders that do different BRDFs, so I'm not quite sure what the benefit is.

Make sure to watch the first talk.

Then watch this one to address the choice of "using different BRDFs" without defines: https://www.youtube.com/watch?v=Ru3YutCVXsM

We use Nabla instead of Unreal for many reasons, and one of them is compiling 40k shader permutations at startup.

I know about PCHs and all the other lousy hacks for C++ and I'm not happy about any of them. One of the things I need to do at some point is provide an hlsl++ module. When I programmed the bulk of the library modules weren't out and even today the support is still not widespread. I've played around with them in a toy stl I'm programming. If there was interest I could consider it. It would take a bit of effort because of all the system headers that could get included which is a bit unfortunate.

Can you share any of your old performance numbers about how much templating HLSL++ hurt your compile times?

Also does everything have to be inline? Are the linkers really that dumb in your experience?

redorav commented 1 year ago

The compile times varied between compilers but one ARM compiler for some reason took about 40 seconds to compile the unit test solution. There is a commit where I changed it all and some numbers. The compilation time was cut by almost half across the board. I wish I'd had something like Compile Score back then to give you more detailed numbers.

Also does everything have to be inline? Are the linkers really that dumb in your experience?

Not everything has to be inline, it's just that I made it header only to begin with for simplicity. Most small methods are force inline because I did measure performance differences for small functions. Some of the larger ones I tagged as inline only and it's up to the compiler to decide. I could do better in this regard for sure if need be and if you have some suggestions in this regard I'm happy to hear them.

devshgraphicsprogramming commented 1 year ago

At the end of the day whether you explicitly specialize your templated struct, or write out separate ones and template their aliases, it doesn't matter.

You'd probably want to use traits-like templates over inheritance

struct float4x4 {};

namespace impl
{
template<int N, int M>
struct matrix;

template<>
struct matrix<4, 4>
{
   using type = float4x4;
};

template<int N, int M>
struct matrix
{
    static_assert(N>1 && N<=4);
    static_assert(N>1 && M<=4);
};

}

template<int N, int M>
using matrix = impl::matrix<N,M>::type;

as you'd want matrix<4,4> to be the same type as float4x4