Layout for types with existentials in them

shader-slang / slang

Making it easier to work with shaders

MIT License

2.02k stars 172 forks source link

Layout for types with existentials in them #811

Closed tangent-vector closed 9 months ago

tangent-vector commented 5 years ago

Suppose we have an interface that we want to use for shader parameters:

interface IRounder { int round(float f); }

And then we want to declare a struct type for our application parameters so that we can pass them in one ParameterBlock:

struct MyParams
{
    IRounder rounder;
    float valueToRound;
};
ParameterBlock<MyParams> myParams;

In order to fully support interface types for shader parameters we are going to need a way for the Slang API to specify the concrete type of myParams.rounder. For the purposes of this issue, lets suppose that that API exists, and the user has decided to plug in this silly implementation:

struct MyRounder : IRounder
{
    int closeEnough;
    int round(float f) { return closeEnough; }
};

Now the question is: what should we use for the in-memory layout of the myParams block?

One naive extreme is to say that existential fields should behave more or less like fields that use a generic type parameter, so that we lay things out as if the declaration had been:

struct MyParams<R>
{
    R rounder;
    float valueToRound;
};

Our current rules for layout out types that use generic type parameters is to follow the pattern of C++ templates and layout out the type that would result after substitution (so that the valueToRound field has a different offset depending on the final type of rounder).

From the standpoint of somebody building an engine layer to wrap parameter setup, having the offset of the valueToRound field change based on what is stored into the rounder field seems sub-optimal. In practice, an engine is going to add a level of indirection so that they store the bytes for the current value bound to rounder in a separate buffer, and then when it comes time to "flush" that state they will marshal things into a single linear buffer.

That kind of marshaling work is actually made harder by laying out interface-type fields inline, because there are more individual copies to make, and the programmer has to account for when an interface-type field (or one that contains an interface type) breaks a contiguous "run" of ordinary fields.

The other extreme is to treat interface-type fields much like resource fields, so that they will always be scalarized out of any containing aggregates, and we compute the size of a type in terms of the number of bytes or ordinary storage it always contains, plus a number of interface-type "slots" that it consumes.

Under this latter model, an application can store the state of a parameter block as a single buffer for the ordinary bytes, plus indirect pointers to the bound object in each of its interface-type slots. When it comes time to "flush" the state, they would simply need to concatenate the ordinary data buffer with the data for each of the interface-type slots in order (handling alignment, etc. along the way).

It seems like our current implementation approach to specialization of interface types will also benefit from scalarization, so that is another mark in its favor.

The biggest down-side to scalarization is that it would not be a good fit for a typical CPU-style approach where an interface/existential type would always occupy a fixed amount of storage because the indirection is ultimately "baked in."

Unless people weigh in with strong opinions, I'm probably going to assume the scalarization-based approach to layout as I keep bringing up interface-type parameters.

tangent-vector commented 5 years ago

PR #886 takes a stand on this issue and implements the scalarization approach. I think it will turn out to be the easiest option for applications to support.

tangent-vector commented 5 years ago

@csyonghe Can I get your input on this topic?

For setup, imagine a user writes something like:

// interfaces for lights and materials
interface ILightEnv { ... }
interface IMaterial { ... }

// per-frame parameters will go in a block
struct FrameParams
{
    ILightEnv lightEnv;
    float time;
    Texture2D somePerFrameTexture;
}
ParameterBlock<FrameParams> gPerFrame;

struct ObjectParams
{
    IMaterial material;
    float4x4 modelView;
}
ParameterBlock<ObjectParams> gPerObject;

The question is what layout we should compute for those global parameters, based on what gets plugged in for the ILightEnv lightEnv and IMaterial material parts.

The model I've been implemented is that the front-end can scan the declarations in a program and determine the number of "existential type parameters" implied by its declarations. It recursively scans the type of each parameter (through fields, etc.) to find leaf parameters with interface types, and each of those counts as one existential type parameter "slot." Then the user can use the API to plug in concrete types for each slot.

So a user might plug in:

struct MyLightEnv : ILightEnv
{
    TextureCube envMap;
    float4 ambient;
}

struct MyMaterial : IMaterial
{
    Texture2D diffuseMap;
    float2 uvScale;
}

Once we know what to plug in, the problem is what layout we should give to those parameters. The layout needs to deal with a few goals/constraints:

It needs to be tractable for the compiler to compute the layout automatically, and to generate output HLSL/GLSL that fits the computed layout.
We need the layout to be something an application can easily "drive" using shader components/objects allocated using layout information computed without the concrete types plugged in.
The above means that we ideally want to make a layout computed without concrete types be as useful as possible (so ideally it doesn't all get invalidated as soon as we plug something concrete in).
We also ideally want the user to be able to query how things ended up being laid out after specialization, in case that helps them to drive things (although I suspect the 99% case shouldn't involve using reflection on specialized programs)

The mental model I've been appealing to so far in the implementation work is that a field with interface type is conceptually a "pointer" to data that needs to be stored somewhere else. That means that all interface-type fields have a fixed size, no matter what concrete type gets "plugged in" to them. In practice, those "pointers" are actually zero-byte fields.

That means that for a type like FrameParams:

struct FrameParams
{
    ILightEnv lightEnv;
    float time;
    Texture2D somePerFrameTexture;
}

The time field is always at byte offset zero, and the somePerFrameTexture field is always at a relative offset of zero texture registers from the start of the value.

Of course, if we are specializing lighEnv to MyLight, then we need a place to store its data. We can think of this as something a bit like the "legalization" we already do for resources. So if you had:

FrameParams gFrameParams;

We would think of it being translated into something like:

struct FrameParams_stripped { float time; Texture2D somePerFrameTexture; }

FrameParams_stripped gFrameParams;
ILight gFrameParams_lightEnv;

And then we specialize gFrameParams_lightEnv based on MyLight:

FrameParams_stripped gFrameParams;
MyLight gFrameParams_lightEnv;

So the fields of MyLight end up getting stored "out of line" from the struct that logically contains them. Of course, this movement of interface-type fields out of their containers needs to proceed recursively, so that nested struct types continue to be laid out as expected.

The approach I'm outlining seems like it also works well for an engine that might have a shader object/component abstraction, since we could have something like:

// over-simplified example of a shader component class
class ShaderComponent
{
    slang::TypeReflection* type;

    std::vector<char> ordinaryData;
    std:;vector<ID3D11Buffer*> bRegs;
    std::vector<ID3D11ShaderResourceView*> tRegs;
    std::vector<ID3D11UnorderedAccessView*> uRegs;

    // new array added to handle interface-type fields
    std::vector<ShaderComponent*> subComponents;
};

Given a declaration like that, when we want to create a shader component based on a Slang type (unspecialized), we can use the current reflection information to tell us how many bytes to allocate for ordinaryData, how many entries to allocate in tRegs, etc. The number of interface-type leaf fields is also queryable through the reflection info, and can tell us how many entries to allocate in subComponents. Given the index or name of a field in the Slang type, we can easily find the offsets to use for writing its data into the appropriate array(s). This seems like a good starting point.

It seems that normally an application won't know the concrete types for the subComponents until it is time to bind parameters and/or draw (or if it implements a "builder" pattern for shader components, then it would know the types at the end of the build process).

In my mind there is an operation to "flush" zero or more shader components out to a constant buffer, parameter block, or what-have-you. It goes something like:

By walking the tree of components recursively, we can compute the total size required by their data fields (taking into account the padding/alignment required for structs in constant buffers). We can use that to allocate a constant buffer for the ordinary data, if needed.
Similarly, we can compute the total number of registers of each kind needed, and use that to allocate an appropriate descriptor table/set.
We can then walk through the objects doing a depth-first pre-order traversal. For each object we write its bytes/textures/etc. into the next available locations in the constant buffer and descriptor table, and then recurse on its sub-components (expecting the depth to be limited in practice).

That should produce an output buffer/block with a layout compatible with what I've been describing so far, and it seems like a natural way for the application to organize rendering in the presence of shader components that can refer to other shader components.

So far so good, but where I'm getting a bit tripped up is how to handle things at the global scope. One particular class of issues is when a user uses an interface type in a place where some parts of the concrete type can "bleed through" to the surrounding context. For example:

Texture2D tex0;
struct PerFrame { float4 x; ILight light; }
ConstantBuffer<PerFrame> cb;
Texture2D tex1;

void myShader(uniform Texture2D tex2) { ... }

It is reasonable for a user who looks at this to conclude that tex0 through tex2 should get the corresponding t registers, while cb gets register b0. But then if I plug in MyLight for cb.light, what do I expect to happen to the MyLight::envMap texture?

I can see three big options:

The MyLight::envMap texure gets t1, since it is conceptually part of cb. The tex1 and tex2 parameters get bumped, which is potentially surprising.
The MyLight::envMap texture gets t2, so that behaves as if it came at the end of the global scope. The tex2 parameter gets bumped, but in practice entry-point parameters need to be prepared to get bumped by the addition of new global-scope parameters, so this is possibly acceptable.
The MyLight::envMap texture gets t3, so that it comes after all other shader parameters. This leaves the existing layout untouched, but has the property that the location of a global shader parameter now depends on what entry points are being compiled together in the program.

Of those options I think I like (2), but this seems like a tricky design choice.

The same basic problem comes up even if we were to ban interface types nested in ConstantBuffer<T>, since you can still have the problem of the resource usage for the contents of a ParameterBlock<T> bleeding through to the surrounding scope when T requires full register space (e.g., for unbounded arrays of textures).

Providing good reflection/layout information for a field like cb.light becomes tricky in some of these cases (2 and 3) since some of its data (the ambient field) gets allocated inside of the parent constant buffer cb, while other parts of its data (the envMap field) gets allocated somewhere that is non-contiguous with the parent variable cb. Option (1) for layout makes things contiguous and removes most of the problem, but does so at the cost of having the worst impact on the layout of other shader parameters.

When it comes right down to it, the challenges that interface types create for layout are quite similar to (and seemingly just as bad as) the problems created by global generic type parameters.

Anyway, if you have any thoughts on what we should prefer to do in the various corner cases, I'd be interested to hear your thoughts.

csyonghe commented 5 years ago

Originally I thought about allocating registers for each interface typed fields in its own space, but then I realized that space itself is a globally indexed "resource" that has the same issue, as long as the indices are assigned in depth-first traversal order (thus the allocation in inner level bleeds to its surrounding context).

I have a rough idea (most likely bad) of using two-dimensional space indices. Let's say we are going to allocate a separate register space for each interface typed field. The register space for all ordinary global and entry-point parameters is 0. Then we can recursively depth-first traverse the tree and mark each interface typed fields a 2-dimensional space number (level, index), where level is the number of nested interface levels of the interface-typed field.

When application binds resources, they always compute the two-dimensional index in depth-first order, and map that two dimensional index into one dimensional space index using some method (e.g. first 16bits and last 16bits for level and index) bind it to that register.

But I admit that this may seem really tricky to users.

csyonghe commented 5 years ago

Hmm, that doesn't really work. Imagine I have

ILight light0;
ILight light1;
void main(){}

and types:

struct ALight : ILight
{
       ILight subLight1;
       ILight subLight2;
       Texture2D tex0;
};

struct BLight : ILight
{
    Texture2D tex1;
};

If I plug in:

light0: ALight                               // space(10000)
{
    sublight1 : ALight                   // space(20000)
    {
         sublight1: BLight              // space(30000)
         {
              tex1;
         }
         sublight2: BLight            // space(30001)
         {
              tex1;
         }
         tex0
    }
    sublight2 : ALight            // space(20001)
    {
         sublight1: BLight        // space(30000)    collide!
         {
              tex1;
         }
         sublight2: BLight         // space(30001)  collide!
         {
              tex1;
         }
         tex0
    }
    tex0;
}

It seems that you will need a multi-dimensional index if you go through this route...

csyonghe commented 5 years ago

I just confused myself. I think it should be fine as long as we can accept the fact that interface typed fields are assigned a weird space index that is unlikely to collide with existing known global parameter definitions, such as starting from 10000.

In this following case

Texture2D tex0;
ILight light0;
Texture2D tex1;
ILight light1;
void main(uniform Texture2D tex2){}

tex0 will get t0, tex1:register(t1), tex2:register(t2), this is to be expected by user. However, anything plugged into light0 will be starting from space100000 and anything plugged into light1 will start from space200000. If members inside light0 requires a space, it will be assigned from space100001 and on. Since when you plug in something to light0 or light1, that thing must be a fully specialized type, and we can always just layout the fully specialized type as a normal type, without worrying about nesting interface fields, because there will be no such thing.

tangent-vector commented 5 years ago

I'm not sure that anything that relies on heavy use of spaces is really practical. First you have targets like D3D11 that have no notion equivalent to spaces, and second you have targets like Vulkan where the equivalent of spaces (sets) has a fairly tight limit.

Anything you can describe in terms of giving things spaces in a two-dimensional fashion can then be turned into a total ordering on the parameters/resources, so we should probably be thinking in terms of what we want the total order to be.

tangent-vector commented 5 years ago

... that thing must be a fully specialized type, and we can always just layout the fully specialized type as a normal type, without worrying about nesting interface fields, because there will be no such thing.

I think this is mostly true. What I'm imagining is that at the API level, if you have a type like ALight with two interface-type fields in it, you'd be able to use an API call like:

auto specializedALight = spSpecializeType(aLightType, { bLightType, bLightType }); // pseudo-code...

The resulting specializedALight type can then be used as for plugging in as a generic type argument, or to fill global/entry-point interface-type holes. We would give an error if you ever try to plug in a type that isn't fully specialized when specializing other types/entry-points.

In terms of implementation, this would create something internally that is a BindExistentialsType<BaseType, Arg1, Arg2, ...> which represents taking the base type and filling in any interface-type holes recursively using the given arguments (plus their interface conformance witnesses). During layout, a BindExistentialsType<B, A1, A2, ....> would (in my mind) lay out as a B (minus its nested interface-type fields), followed by an A1, then an A2, etc. So the layout of B is always consistent, no matter how it gets bound, but the layout of a BindExistentials<...> type is always nice and fixed and doesn't affect the layout of anything else.

csyonghe commented 5 years ago

The problem we are having with total ordering in one dimension is that they always affect the indices globally, so there has to be compromises on which assignments can be preserved and which assignments must be queried after specialization.

If we can accept that fact that the register assignments are not necessarily continuous, we can always compute the number of worse case of all existing non-existential entry-point parameters, say 5 textures, and start to allocate texture registers for existential types from register(t5). This will make sure that all non existential parameters are allocated the way a user will think it to be without considering existentials, and will not have the inconvenient issues of option 3.

csyonghe commented 5 years ago

For example, if we have

Texture2D tex0;
struct PerFrame { float4 x; ILight light; }
ConstantBuffer<PerFrame> cb;
Texture2D tex1;

void myShader(uniform Texture2D tex2) { ... }
void myShader2(uniform Texture2D tex2, uniform Texture2D tex3) { ... }

If we plug in EnvLight to cb.light, then cb.light.envMap will get register(t4).

csyonghe commented 5 years ago

Oh, I guess what you were saying is that depending on how many files that is currently in the same linkage, this "worst case number" can change?

I am not sure I understand why this is an issue for the users, since they can always query the reflection data before specialization to find out how many non-existential textures that entry point is using, without having to query any after-specialization reflection data.

csyonghe commented 5 years ago

Yeah, now I am starting to get confused at what the issue you mentioned for option 3 is, and why it is undesirable.

tangent-vector commented 5 years ago

The problem we are having with total ordering in one dimension is that they always affect the indices globally...

Right, but the problem is that no matter what we can't leave large gaps in the layout for D3D11, Vulkan, or pretty much any target that isn't D3D12. And then even on D3D12, we can leave large gaps in the spaces, but at the end of the day the application needs to collapse all those gaps out when filling in their descriptor tables, so that the in-memory layout is consistent with some linearization.

It seems to me that the most logical way to handle things is to define a linearization scheme (in terms of how it visits parameters and their dangling interface-type data), and then have both our compiler and application code implement consistent linearization approaches (e.g., "whenever filling out a descriptor table/set based on some shader components, here is how to traverse them...").

Yeah, now I am starting to get confused at what the issue you mentioned for option 3 is, and why it is undesirable.

It may actually be the right option; that's what I'm trying to feel out.

The issue is if you have something like:

ConstantBuffer<IFoo> myFoo;
Texture2D gTex;

void someComputePass(uniform Texture2D entryPointTex) { ... }

If we assume that the IFoo gets plugged in using a type that uses one Texture2D slot, then the three layout options are:

myFoo:t0, gTex:t1, entryPointTex:t2, The layout for everything can change when we plug in a new concrete type, but the linearization is conceptually easy to describe: depth-first pre-order over all parameters.
gTex:t0, myFoo:t1, entryPointTex:t2. The layout for entry-point parameters can get shifted, but we have the guarantee that two entry points compiled using the same global parameters will agree on the layout for those global parameters, even if the entry points aren't part of the same program.
gTex:t0, entryPointTex:t1, myFoo:t2, The layout for non-interface parameters doesn't shift at all when changing what concrete types we plug in.

The concrete problem for option (3) is if the user also declared:

ConstantBuffer<IFoo> myFoo;
Texture2D gTex;

void someOtherComputePass() { ... }

Now under the rules for (3), the myComputeShader pass puts myFoo:t1, and so we have to re-bind parameters related the global scope when switching between entry points, even if they agree on what is bound at the global scope.

The thing about the global scope is that you already have to deal with the possibility of it shifting around, and we've seen that before when we've done generics-based specialization in Falcor. If you have a shader like:

// MyShader.slang

type_param L : ILight;
ParameterBlock<L> gLight;

void fsMain(uniform Texture2D myTex) { ... }

Then an initial compile of MyShader.slang will tell you myTex:t0, but then what happens if the user binds L to use a type declared in a different file:

// LTCAreaLight.slang

Texture2DArray gLTCLookupTable;

struct LTCAreaLight : ILight { ... }

The problem is that the unspecialized shader didn't have the gLTCLookupTable parameter at all, but the specialized shader must have it. If we keep a consistent rule that global parameters get laid out before entry-point parameters, then we must put gLTCLookupTable in t0 and bump myTex to t1.

It seems almost impossible to avoid the problem that specialization can link in new modules, and those new modules can have new global parameters (of course, to go really deep, what happens if specialization requires us to link in new modules that bring along new parameters that themselves require specialization...).

So it seems like we need to decide between:

a. We consistently enumerate the global scope of a program (unspecialized or specialized) before the entry-point scope(s). As a consequence, steps that change the global-scope layout can affect entry-point layout.

b. We find a way to enumerate the parameters of a specialized program in a way that is always consistent with the unspecialized program, so that entry-point parameters in the unspecialized program come before global parameters that were introduced as part of specialization (either by requiring new modules to be linked in, or by filling in interface-type holes).

Option (3) for dealing with interface-type parameters seems consistent with option (b) here, while option (2) and option (a) are mutually consistent.

It seems like neither of us wants to go with option (1).

Note that right now I'm focusing on layout for resource types and for spaces sets. Hopefully we are on the same page about the Right Way to deal with uniforms, but when you talk about allocating a full space to each interface-type parameter, I start to wonder if we aren't aligned (are you advocating that each interface-type parameter amounts to a separate ConstantBuffer?).

tangent-vector commented 5 years ago

To open up an unrelated can of worms: the compiler needs to start implementing some kind of coherent plan around how entry-point parameters get dealt with, and what the role of a Program is (currently it is an abstraction in the implementation, but is not surfaced through the API).

We probably need to take a stand that if a user compiles multiple entry points together in a single request, then they are implicitly declaring an intention to use those entry points together in a single program/pipeline. That means that compiling a vertex/fragment shader pair together makes sense (and you need to compile them together to get reasonable layout results), but compiling multiple compute kernels together doesn't make sense.

Then we can say that (ignoring all the interface/generic stuff), the layout for a program is:

lay out all globals, across all referenced modules
lay out each entry point, in a deterministic order, and don't allow parameters from one entry point to overlap with those from any other

That approach already has the property that the parameters of a single fragment shader entry point could be laid out at different locations globally depending on what vertex shader you pair it with, but that is fine. The main thing we'd want to guarantee is that the parameters of a given entry point are always contiguous, so that a pre-allocated shader component/object that encapsulates entry-point parameters can be re-used across different passes with the same entry point.

For RT shaders, we need to tweak that a bit, since the entry-point parameters of RT entry points should probably be assumed to be in the "local root signature" so that they should be allocated as overlapping with other entry points.

All of this gets tied up in the question of how specialization affects layout, because some of the messiest questions for specialization (our options (2) and (3)) end up being around how it should interact with per-entry-point parameter layout.

tangent-vector commented 5 years ago

Another thing to keep in mind here is that the relative ordering of sets on Vulkan has performance implications (drivers are supposed to assume that lower-numbered sets represent less-frequently-varying parameters), so if a specialized parameter somehow maps to multiple sets, we might want to keep those sets in the same relative order to other sets, even if it means shifting sets later in the overall order. That should be fine for an application in practice, because it should probably be thinking about sets in terms of linearization and/or a stack model, rather than treating them as things with random access.

csyonghe commented 5 years ago

I think it is fair to assume that all entry-points compiled in a single compilation requests belongs to a single pipeline, and lay out the parameters such that all specialized parameters come after ordinary (entry point and global) parameters. Then as long as the same set of entry-point is in the compilation request, the layout we generate for unspecialized and specialized entry-points should be consistent.

For me the reasonable work flow is to compile a unspecialized program (with all modules that may be used to specialize the program linked), and get layout for all entry-points in the program, and expect that all ordinary parameters in this layout is valid and final. Then when we generate specialized code, all the engine need to care about is to generate layout for the parameters that are used to specialized the code, and stitch that to the end of the layout of the unspecialized program. And it seems that option 3 matches this workflow best.

tangent-vector commented 5 years ago

Would you advocate that we follow the same logic when specialization requires a new module to be "linked" into the program, and brings along new global shader parameters? That is, if I specialize some ILight parameter to LTCAreaLight then should the global lookup table that light type requires come after all the entry-point parameters as well, just to keep things consistent?

EDIT: What you are advocating for matches the precedent we tried to have with generic type parameters, but I actually found that to be kind of annoying in practice, because as a user it meant that the order I bind my parameter blocks in on the host side needed to change when I switched a ParameterBlock<ConcreteType> to a ParameterBlock<TypeParameter>.

tangent-vector commented 5 years ago

Oh, wait, I just noticed you said:

For me the reasonable work flow is to compile a unspecialized program (with all modules that may be used to specialize the program linked), ...

So you are constraining the application to need to proactively put all their loaded modules of utility code into the mix in case an entry point needs to refer to them. That is of course a very robust strategy, but I'm not sure it is scalable for very large shader codebases, or codebases where a lot of modules introduce global shader parameters.

csyonghe commented 5 years ago

Yes that is what I was advocating. What you mentioned is indeed a valid concern that I did not give too much thought of. But logically I think this is a good order:

Ordinary global parameters that are inside the initial linkage containing the unspecialized entry-point
Ordinary global parameters that are brought in to the linkage at specialization time
Existential parameters that are provided in a specialization request

This is because the additional parameters that are brought-in to the linkage at specialization is a form of specialization, and therefore they should be laid out after general entry-point parameters.

However, I think we need some mechanism to clearly convey this thinking to the user. I feel that it will create a lot of confusion to the user because some global parameters for laid out first but others are not. If we choose to go down this route then It might be wise to introduce some additional syntax to clearly indicate that LTCAreaLight lookup texture will only be used when LTCAreaLigh is in-use, and will come after other ordinary parameters.

tangent-vector commented 5 years ago

I think part of the challenge here is that we need to juggle two priorities when it comes to how users see the rules:

A. The rules need to be easy to understand B. The rules need to be easy to implement

Part of the problem is that (A) doesn't always align well with (B).

Rules that are easy to understand mostly amount to sweeping top-down declarations of invariants/constraints on the expected behavior (e.g., "all X appear before all Y"). Rules that are easy to implement are usually locally actionable (e.g., "to process an X, first do Y, then recursively process all the Zs").

I think my leaning toward option (2) in the earlier lists, over option (3) is partly motivated by an assumption of what would be easier/harder for an application to implement. In particular, I'm thinking about the constraint of applications that might want to do reflection on unspecialized types/entry-points, but not on specialized types/entry-points.

If we can describe the entire process in terms of "if you have all these shader components you want to bind, here is the algorithm to use when flattening them into a linear sequence of bindings" then it becomes easy to implement, and the absolute offset of any particular component in that linear sequence is less important than the procedure that produces it.

A policy that lays out interface-type fields after everything else is implementable with these kinds of rules, but it requires applications to maintain a kind of "backlog" of stuff that is waiting to be bound after everything else, so that "binding" a shader component binds parts of it, and puts other parts onto the backlog.

From an application implementation standpoint, there is a certain appealing simplicity if a ParameterBlock<Foo> can be bound to the pipeline state in one go, and never has to be revisited when dealing with parameter blocks later in the declaration order.

I want to reiterate a question from earlier though: how do you think ordinary/uniform data should fit into this? My assumption has been that if you put an interface-type field in a ConstantBuffer or ParameterBlock, then its ordinary/uniform data will get laid out in that buffer/block, just after the ordinary/uniform data for the non-interface fields. Similarly, I assume that if you have a ParameterBlock<Foo> then any registers/bindings used by the interface fields of Foo will get folded in that parameter block (and hence need to be bound in the same descriptor table/set) rather than implicitly spilling out into the global scope (where they'd get placed after all the other parameters using either policy (2) or (3)).

csyonghe commented 5 years ago

I agree that the behavior of existential types should be similar to generics in that ordinary fields defined in an existential type get folded into its parent ParameterBlock, this allows applications that use the bindless pattern to work without much friction.

The layout issue gets complicated enough that even I find it hard to sort out how it is going to work for each API. So I see the value to keep it as simple as possible, as long as it still allows a efficient work flow.

I think the biggest challenge in adopting existential types is to clearly figure out the binding logic for it. This feature cannot be successful if its not easy to figure out how to bind them. ParameterBlock is a good feature because it makes the binding problem easier to think. However I don't have the same feeling about existentials since the amount of mental overhead to think about its binding logic is huge, especially when they can be used almost everywhere. Admittedly global generic types have the same issue, but when limited to just the top level parameters it is still manageable. When I am writing my engine, I always tried to keep things simple by avoid using resource types in nested arrays/structs and generic types, because I really don't want to work deeply into these things. As a compiler we have to think though these corner cases, but I do think that supporting the bindless pattern is the ultimate way to these issues.

tangent-vector commented 5 years ago

I think we are on the same page that in many ways all this layout trickery is just a workaround for not having first-class bindless support on our targets. If we had bindless texture/sampler handles then we could at least collapse all of the resource stuff down into the ordinary/uniform case, which already makes things a lot easier. Then if we add something like GL_EXT_buffer_reference, we can use device-memory pointers to refer to the stuff in interface-type fields instead of having to represent the "out of line" storage via a workaround.

The tricky part is making something in the interim that proves out the value of the feature. I think I'm won over to your side on approach (3) simply because it is the closest to where things would be if we had bindless/pointers, and thus is as close as possible to what we want users to write in the future.

(It might be an interesting exercise for somebody to implement this functionality on top of the Metal API/language, since it has more direct support for the kind of indirection we want)

I do think that applications that have a ShaderComponent abstraction close to what I've outlined earlier in this thread can do quite a bit to insulate themselves from the complexity, by just writing code that works systematically in terms of that abstraction. The onus is on us to show that is the case, though. Once I have the basics working I may try to port the current Slang samples to use the new approach, just to see if it can be as easy as I hope.

csyonghe commented 5 years ago

I would argue that inclusion of optional-feature-specific parameters should be made more explicit. It still feels strange to me to brought in additional global parameters when specializing. If we know that some optional features might be brought-in to an entry-point, it makes sense for the entry-point definition to reserve a binding slot for that optional feature. I would prefer to use something like:

interface ILight
{
    associatedtype TStaticParameters;
    float3 computeLighting(TStaticParameters staticParams, BRDF f, ...);
}
ParameterBlock<ILight> gLight;
ParameterBlock<ILight.TStaticParameters> gLightStaticParams;

float3 main()
{
      gLight.computeLighitng(gLightStaticParams, ...);
}

tangent-vector commented 5 years ago

At this point we are talking through things we've discussed before, but I guess having some of this material visible on GitHub is a good thing.

The idiom you are describing with a sort of "shared data" associated type is certainly a reasonable workaround in the presence of generics:

type_param L : ILight;
struct MyLightEnv { L lights[10]; L.StaticParams staticParams; }
ParameterBlock<MyLightEnv> gLightEnv;

In the generics case, it can be made clear that the staticParams field holds the data needed by lights.

Unfortunately, as far as I can tell this starts to break down in the interface case. In your example above, what is guaranteeing to the front-end that the concrete type of the data in gLight matches the concrete type of the data in gLightStaticParams? If I had an array of ILight where each entry might have a different type (using tagged-union dispatch), how would I be sure to associate each entry with the correct static parameter data?

Even if we find a way around those issues, it is clear that the explicit-ness of having the extra TStaticParameters associated type comes at a somewhat high cost, since all the parts of the system have to start passing it around. If a lighting system is designed without this feature, and then suddenly one light type needs static parameters, there will be a high cost to retrofitting in this kind of "static params" feature.

if we look at how a host application programmer would deal with the same kind of issue, they'd just pick one of the following two strategies:

If the data is truly global (maybe even constant), then a global variable or static variable is the appropriate way to handle it. As a point of comparison, CUDA supports global __device__ variables and they represent true (application-wide) globals instead of kernel parameters.
If the data is not truly global (different passes might need to pass different values), but it is something where different implementations of the same data may need different types of "shared" data, and different instances with the same shared data should only pay for one copy, then that is the case where a pointer to the shared data should be stored within each instance.

The first of these options is the one that an LTC are light seems to want/need, and it would be good to think about how an application can be set up to support "true globals" in shader code.

The second option requires bindless and/or pointers to do well. We could conceivably have a Shared<T> type that behaves a bit like a const T& so that we can have multiple shared references to it, but only pay for the storage once per unique referrant, but implementing it with the same kinds of tricks we have to do for interface types seems to be on the border of intractability.

tangent-vector commented 5 years ago

At this point I think I have a mostly-working branch that implements option (3) from the discussion we've had (everything related to concrete types plugged in for interfaces gets laid out after everything else, including entry-point parameters). I'm going to try and get that work checked in soon, but the more I think about this the more I'm convinced that there is a better way to approach the whole problem space.

There is an abstraction in the compiler now called a Program that isn't being exposed in the public API, but might be at some point. Conceptually, the API for working with Programs looks something like:

Program* createProgram();
void addModule(Program* program, Module* module);
void addEntryPoint(Program* program, EntryPoint* entryPoint);

A Program tracks a list of modules that the program depends on. The addModule function adds a module to that list, but first adds all the modules that module depends on, recursively.

The Program also tracks a lit of entry points. The addEntryPoint functions adds an entry point, but first invokes addModule on the module that defines the entry point.

This definition makes it easy to enumerate the shader parameters in a total order:

Loop over all modules in the program, and loop over the parameters of each
Loop over all the entry points, and loop over the parameters of each

If we want to specialize a program, then we'd like to start by creating a copy of the unspecialized Program, cloning its lists of modules and entry points, and then continue from there.

Program* cloneProgram(Program* program);

// go on and add more modules and/or entry points...

For global shader parameters, this would guarantee that we don't change the layout of existing parameters, since we don't change their position in the total order (I'm ignoring all the issues that arise around adding a new module with an explicit register binding).

The problem is that with the representation above, adding a module could change the position of an entry point in the total order, and thus change its layout.

The somewhat obvious choice at this point is to say that the main job of a Program is to record the total order. This wouldn't change the public API, but rather the behind-the-scenes representation. A Program would now track an ordered list of "items" where each item is either a module or an entry point. The addModule and addEntryPoint operations continue to work just as before. Now the total order is defined as:

Loop over each "item" and loop over the parameters that item contains

This generalized form of things allows us to clone a program, and add more stuff with the guarantee that nothing that was laid out so far will be affected (modulo explicit register bindings).

The list of "items" is, at some level, a list of values to be allocated in the memory image of the shader program, each with an associated type. The "type" of a module is conceptually a struct that contains all its global-scope shader parameters, while the "type" of an entry point is a struct over its parameters. In this way, the addModule and addEntryPoint operations could return an index that can be used to look up the starting offset(s) for parameters in the module/entry-point, so that reflection information computed from the module/entry-point in isolation can be applied to a program that uses it:

// return value is an ID that can be used to look up reflection info later
int addModule(Program*, Module*);
int addEntryPoint(Program*, EntryPoint*);
``

We can then provide users with a general-purpose routine to carve out part of the space in a program for a value of arbitrary application-specified type:

```c++
int addGlobalParameter(Program* program, Type* type);

Just as with modules and entry points, the return value is an ID that can be used to look up layout information on where the new parameter got placed. The allocated parameter is just another "item" in the program, that gets laid out just like everything else.

Of course, no shader code in the program would actually use such a parameter, so it might initially seem silly to define one. The missing link here is that we could add an operation to "bind" an interface-type parameter in the program to a global parameter by its ID:

// Naming intentionally highlights similarity to `cgConnectParameter`
void connectParameter(
    Program* program,
    int interfaceTypeSlotID, // index of an interface-type parameter
    int globalParamID); // value returned by addGlobalParameter

In this way, the user can specify that an interface-type shader parameter that has already been laid out should be treated as is if points to some piece of global parameter data they allocated by hand.

An API like this would put the application in complete control over the in-memory layout of a program, based only on the order in which they add different items to it. It would be an easy matter for an application to ensure that certain pieces of a layout are consistent between different kernels in their program (e.g., making sure that all RT kernels used together agree on the global scope and its layout).

There's a lot of detail I'm leaving out here (e.g., there would need to be a way to allocate a parameter that lives inside some pre-existing buffer or parameter block...), but the basic idea seems sound as a way to give programmers complete control over the layout, and rely less on them understanding some very detailed rules about how we lay types with interfaces in them after specialization.

Anyway, I'm going to plow ahead on my current implementation course, rather than scrap my work and try again with the approach I've outlined here. At some point I'd like to step back and decide whether this kind of revamp would be worth it.

natduca commented 9 months ago

@csyonghe should we keep or close?

csyonghe commented 9 months ago

Let's close this for now since existentials with non-trivial layouts isn't a feature requested by any of our users and it is extremely tricky to get right. With bindless this will not be needed, and we may decide to never support them.