microsoft / win32metadata

Tooling to generate metadata for Win32 APIs in the Windows SDK.
Other
1.34k stars 116 forks source link

d3d12: Problem with functions that return structures by value #636

Closed marler8997 closed 1 year ago

marler8997 commented 3 years ago

Original Issue: https://github.com/marlersoft/zigwin32/issues/6

I've coped the original issue here which was created by user @michal-z

The original d3d12.h that comes with Windows SDK has wrong C prototypes for functions that return structures. Fixed headers are released by Microsoft here: https://github.com/microsoft/DirectX-Headers/blob/main/include/directx/d3d12.h

Consider ID3D12Heap::GetDesc method, it should be defined like this:

        DECLSPEC_XFGVIRT(ID3D12Heap, GetDesc)
        D3D12_HEAP_DESC *( STDMETHODCALLTYPE *GetDesc )( 
            ID3D12Heap * This,
            D3D12_HEAP_DESC * RetVal);

The proper C prototype takes a pointer to RetVal as a second parameter and also returns the same pointer. This applies to all functions that return structure by value.

kennykerr commented 2 years ago

I believe this is just something you have to deal with.

https://devblogs.microsoft.com/oldnewthing/20220113-00/?p=106152

The windows crate for Rust already detects methods returning structs and transforms them in a way that is consistent with the Visual C++ compiler. For example:

pub struct ID3D12Heap_Vtbl {
    pub base: ID3D12Pageable_Vtbl,
    pub GetDesc: unsafe extern "system" fn(this: *mut ::core::ffi::c_void, result__: *mut D3D12_HEAP_DESC),
}
impl ID3D12Heap {
    pub unsafe fn GetDesc(&self) -> D3D12_HEAP_DESC {
        let mut result__: D3D12_HEAP_DESC = ::core::mem::zeroed();
        (::windows::core::Interface::vtable(self).GetDesc)(::core::mem::transmute_copy(self), &mut result__);
        result__
    }
}
michal-z commented 2 years ago

Note that Microsoft DirectX-Headers project (https://github.com/microsoft/DirectX-Headers/blob/main/include/directx/d3d12.h) fixes this issue and declares function correctly.

C prototype is:

        DECLSPEC_XFGVIRT(ID3D12Heap, GetDesc)
        D3D12_HEAP_DESC *( STDMETHODCALLTYPE *GetDesc )( 
            ID3D12Heap * This,
            D3D12_HEAP_DESC * RetVal);

C++ prototype is:

        virtual D3D12_HEAP_DESC STDMETHODCALLTYPE GetDesc( void) = 0;
kennykerr commented 2 years ago

Yes the C (not C++) declaration may be wrong in the Windows SDK. My point is just that the metadata is correct as it is based on the C++ declaration.

michal-z commented 2 years ago

I see, thanks for your input.

marler8997 commented 2 years ago

Yes the C (not C++) declaration may be wrong in the Windows SDK. My point is just that the metadata is correct as it is based on the C++ declaration.

So the goal of the win32metadata project is to represent what's in the C/C++ headers, not what the actual API is? So if there is a mistake in the headers, the win32metadata project is meant to keep the mistake in. If this is the case then we will need to walk back some of the other decisions that have been made such as fixing SAL annotations and mistyped parameters in certain functions.

damyanp commented 2 years ago

The actual API is correctly described by the C++ code in the header. The old C declarations (that returned the structs) were broken and would not work if you tried to use them to call those APIs.

kennykerr commented 2 years ago

As Damyan said, the metadata correctly reflects the C++ declaration which is also correct.

marler8997 commented 2 years ago

Yeah I think I'm confused. Re-reading the thread isn't clearing it up for me so ignore me for now. Not sure what the next step is here.

kennykerr commented 2 years ago

Practically it means that if you are building a projection, you need to detect virtual functions that return a user-defined type (e.g. C-style struct) and transform the underlying ABI signature, unless you're targeting the Visual C++ compiler in which case you need not do anything special. The transformation involves removing the return type and injecting a pointer to that type as the first parameter, following this but before any explicit parameters.

marler8997 commented 2 years ago

If I understand, you're saying that unless you're creating a C++ projection that is only meant to be used by the Visual C++ compiler, then you need to modify the API. A few questions come to mind. Why is this? Does the Visual C++ compiler also make this correction internally when it encounters these APIs? Also, if every projection needs to do this transformation, how do we know when it needs to be done? Is it a particular set of namespaces in the metadata and only for COM functions, or maybe this needs to be applied to ALL COM functions across the board? Also, what exactly is a "user-defined" type? You mentioned a C-style struct as an example but what else would qualify?

damyanp commented 2 years ago

The APIs in question here are defined using COM IDL (https://github.com/microsoft/DirectX-Headers/blob/main/include/directx/d3d12.idl). This has a well-defined ABI that also maps well to virtual functions using stdcall. I'm pretty sure all other C++ compilers on Windows correctly implement stdcall.

The d3d12 header files that are read by win32metadata are generated by midl. It generates two versions of each interface - the C++ one and the C one. The C++ version uses classes and virtual functions to define the interface, while the C version uses a struct with pointers to functions to define the vtable. The problem is that there was a bug in midl where it assumed that stdcall function pointers and virtual functions have the same ABI when, in fact, they don't.

For a function defined in the IDL as, say:

D3D12_HEAP_DESC GetDesc();

Then in the C++ version midl would generate (correctly):

virtual D3D12_HEAP_DESC STDMETHODCALLTYPE GetDesc( void) = 0;

And for the C version it would generate:

        D3D12_HEAP_DESC ( STDMETHODCALLTYPE *GetDesc )( 
            ID3D12Heap * This);

However, this is a bug because the __stdcall ABI is different for function pointers vs virtual functions, and the end result is that the C function pointer is incorrect. This bug has now been fixed and so newer version of D3D12.h contain:

        D3D12_HEAP_DESC *( STDMETHODCALLTYPE *GetDesc )( 
            ID3D12Heap * This,
            D3D12_HEAP_DESC * RetVal);

The relevant part of the docs:

On x86 platforms, all arguments are widened to 32 bits when they are passed. Return values are also widened to 32 bits and returned in the EAX register, except for 8-byte structures, which are returned in the EDX:EAX register pair. Larger structures are returned in the EAX register as pointers to hidden return structures. Parameters are pushed onto the stack from right to left. Structures that are not PODs will not be returned in registers.

This issue comes up more often in DirectX related APIs since these return large structs from methods that by definition can't fail. A more common pattern COM APIs is to always return an HRESULT and take an out-pointer. This, combined with very few people using the C-style interfaces, helps explain why this midl issue went unnoticed until relatively recently.

marler8997 commented 2 years ago

That was a great explanation, thanks @damyanp. To make sure I understand, the problem comes from incorrectly interpreting the ABI of "virtual functions". Problems arise because virtual functions follow the stdcall ABI in most cases, but when you return a large enough struct it diverges. Also, you can represent a virtual function ABI that has a large return value with a stdcall function signature by making the return value an inout pointer argument.

Assuming I've got it right, is this the only known case where the virtual function ABI differs from stdcall?

A quick google reveals stdcall information is here: https://docs.microsoft.com/en-us/cpp/cpp/stdcall?view=msvc-170 It also looks like the COM methods probably use what Microsoft calls "__thiscall"? https://docs.microsoft.com/en-us/cpp/cpp/thiscall?view=msvc-170

They both say the "Callee" cleans up the stack and parameters are pushed onto the stack from "right to left". stdcall says return values are handled the same as "cdecl" but I can't seem to find where it says how "thiscall" handles return values.

P.S. For Zig one option would be to add another calling convention for "thiscall"

riverar commented 2 years ago

The metadata describes this as:

D3D12_HEAP_DESC GetDesc();

Looks like it should be:

DECLSPEC_XFGVIRT(ID3D12Heap, GetDesc)
D3D12_HEAP_DESC *( STDMETHODCALLTYPE *GetDesc )( 
  ID3D12Heap * This,
  D3D12_HEAP_DESC * RetVal)

We can fix that. Not sure why we got so off track on this one.

kennykerr commented 2 years ago

Not sure there's value in "fixing" this in metadata unless you can catch all such cases. In Rust I just detect this automatically and transform the ABI accordingly.

riverar commented 2 years ago

I think there is value is in baking in the necessary secret C++ compiler ABI knowledge needed to use these APIs correctly (on x86) so that other folks don't stumble on this. If no one else agrees though, then I'm happy to close this issue and move on.

marler8997 commented 2 years ago

@kennykerr what criteria do you use in the rust bindings to know when to include the return type as a reference parameter?

It looks like there are some BeefLang bindings. The author of those says they will only transform the return type if the type is:

  1. not a native typedef
  2. not an enum
  3. not a COM type

Is this the same criteria that the rust bindings use? It seems to me that the correct final solution for this is that each language support the thiscall ABI, but, I cant' seem to find where it's defined.

P.S. I wonder if LLVM has the ABI defined?

kennykerr commented 2 years ago

I limit it to those methods that return a type that is both a struct and doesn't have the NativeTypedef attribute. In my parser, structs automatically exclude enums and interfaces.

AArnott commented 2 years ago

Wow. A lot to take in there. It seems that the metadata could be changed for these problematic methods so that their signatures match what projections must declare so things work by default. Maybe that would even work for C++, I don't know.

marler8997 commented 2 years ago

As I understand, the metadata already has all the information needed. Namely, all "virtual functions" (which I think is equivalent to "all COM methods") use the thiscall ABI rather than the stdcall ABI.

Take the following example signature in the metadata:

class SomeComInterface {
    SomeStruct foo(int a, int b);
}

If your language has native support for the thiscall ABI, then you're done, i.e. for C++ we have:

// C++, correct because virtual functions use "thiscall" ABI
class SomeComInterface {
    virtual SomeStruct foo(int a, int b);
};

If your langauge doesn't understand the thiscall ABI, then you can use the stdcall ABI and fixup the function signature to behave like thiscall, i.e. here's the same method in C:

// Signature in C which doesn't have "thiscall" ABI defined
SomeStruct * (STDMETHODCALLTYPE *foo)(SomeComInterface *This, SomeStruct *RetVal, int a, int b);

The metadata could try to make this "fixup" easier for languages that don't support "thiscall" by marking which methods would need this "fixup", but, that's a limitation of specific projections so we may not want to put it in the metadata that's shared by all projections.

For Zig, I think we'll start by emulating the thiscall ABI the way C does, but, maybe in the future we'll add support for the Thiscall ABI which would look something like this:

// an example of what it would look like in Zig if we added support for "thiscall"
var foo: *const fn(self: *SomeComInterface, a: c_int, b: c_int) callconv(.Thiscall) void;
AArnott commented 2 years ago

Thanks. I have a question though. If COM interfaces always use the thiscall convention, why doesn't .NET use it by default for COM interfaces? How did they get that wrong?

kennykerr commented 2 years ago

COM has worked this way for time immemorial but it's a pretty obscure corner that few discover. It has become more prominent with the proliferation of compilers and languages on Windows in recent years.

AArnott commented 2 years ago

@kennykerr Are you saying this is a bug in the .NET (Framework/Core) runtime in that it improperly makes COM calls in these cases?

riverar commented 2 years ago

Looks like there's no work here for metadata.

ID3D12Heap::GetDesc is currently defined in metadata as

instance valuetype Windows.Win32.Graphics.Direct3D12.D3D12_HEAP_DESC GetDesc () cil managed

which appears to conveys enough information to do the right thing.

@marler8997 OK to close?

michal-z commented 2 years ago

How one figures out from above metadata that the real C function prototype is:

        DECLSPEC_XFGVIRT(ID3D12Heap, GetDesc)
        D3D12_HEAP_DESC *( STDMETHODCALLTYPE *GetDesc )( 
            ID3D12Heap * This,
            D3D12_HEAP_DESC * RetVal);

?

marler8997 commented 2 years ago

@kennykerr Are you saying this is a bug in the .NET (Framework/Core) runtime in that it improperly makes COM calls in these cases?

It looks like DllImport accepts a calling convention and it supports "thiscall". https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.callingconvention?view=net-6.0

Are the .NET bindings specifying "thiscall" for COM methods? If they are, then it does sound like this is a big in how the marshaller implements "thiscall", which, wouldn't be the first time I've seen a bug in this area.

AArnott commented 2 years ago

I'd prefer that we not close the bug just yet. [DllImport] applies to p/invokes in .NET. As far as I know, the attribute cannot be applied to methods on an interface, which is where we're having this problem. I couldn't find a way last week when I looked to modify the calling convention to thiscall on a COM interface method in C#.

which appears to conveys enough information to do the right thing.

I guess that method signature only conveys enough information if there is side information somewhere stating that thiscall is the calling convention. If that's implied because it's COM and all COM uses thiscall, then fair enough. But I feel unsettled till we get confirmation that this is indeed a bug in the CLR (and maybe Core CLR) runtimes, and if so, get a statement about whether they'll fix it, or what their recommendation is, just as confirmation for our belief that it's a bug there rather than something else.

It still feels odd that if every projection except C++ has to change the signature to work properly, for the metadata to side with the C++ signature, but I guess it sort of makes sense because the signature looks the nicest as the metadata represents it, so any projection that can represent it that way should do so. I'd just like to be clear on how the rest of the projections know that they can't represent it that way, and get that documented somewhere.

riverar commented 2 years ago

Metadata has:

.method public hidebysig newslot abstract virtual 
        instance valuetype Windows.Win32.Graphics.Direct3D12.D3D12_HEAP_DESC
        GetDesc() cil managed

Its method definition signature (MethodDefSig) bytes are 20 00 11 C0 00 57 FC. The first byte holds attributes for the calling convention--in this case, attribute HASTHIS (0x20) and calling kind DEFAULT (0x0) ORed together. HASTHIS indicates that a this pointer shall be passed to the method.

So with that information, we can create a compatible C signature (x86) as such (if needed):


Fast forwarding to your COM interface method in C# comment.

You are probably looking at something like this:

[ComImport]
[Guid("...")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface ID3D12Heap
{
    D3D12_HEAP_DESC GetDesc();
}

In the debugger, hovering over that method (hanging off an object, not pictured) will show a CallingConvention of Standard | HasThis. This System.Reflection.CallingConventions.Standard value maps to CLI calling kind DEFAULT, not to be confused with stdcall or other unmanaged calling kinds. Somewhere in the depths of the CLR, the HasThis attribute will ultimately drive some decision making, like shoving the this pointer into an appropriate register (e.g. ECX on x86, RCX on x64) at call time.

Does that help any? Or did I misunderstand the question/bug?

AArnott commented 2 years ago

That helps, thank you. I didn't realize the calling convention was embedded in the method signature in a way that an IL decompiler wouldn't reveal it. The blog post link was also very useful. So, what exactly defines a struct? Evidently HRESULT doesn't qualify as a struct. I'll need to know from a .NET perspective how to read the metadata return type and decide whether it needs this special treatment. In .NET, every value-type is a struct (even int and HRESULT), so clearly we're operating with some other definition of struct for purposes of this conversation.

Also, most of this discussion and the entire blog post seems to center around x86 rules. What does that mean for x64 and arm64 architectures? Should I not transform the method signature for those cases? Is it allowed but not required? If the transformation is required and allowed only for x86, I guess that effectively makes any COM interface that includes struct-returning methods to be arch-specific, making them unavailable to AnyCPU compilations.

riverar commented 2 years ago

Heads up: The instance keyword in instance valuetype Windows.Win32.Graphics.Direct3D12.D3D12_HEAP_DESC GetDesc() cil managed maps to HASTHIS.

For the implementation side of that special return type rule, a decent heuristic might be to count the number of fields. If it's 2+, then do the dance. This will fail in cases where the struct contains tiny fields that all fit in a single register, or if one of the fields is a struct itself... but skimming the handful of affected COM methods that return structs, this doesn't seem to ever be the case.

Alternatively, you could check if the return type is a struct and isn't annotated with [NativeTypeDef]. Kenny does this in the Rust world and it has proven to be good enough thus far.

Or maybe even a Marshal.SizeOf<T>() > IntPtr.Size check could hold water. Just thinking out loud.

AArnott commented 2 years ago

you could check if the return type is a struct and isn't annotated with [NativeTypeDef].

I think I'll go with that. But I'll have to modify it to also disregard primitive types since int is a struct and I'm sure it isn't meant to fall into this category. I just wish I knew what might make a good set of 'primitives' to disregard.

mikebattista commented 1 year ago

Did we conclude that this issue can be closed?

AArnott commented 1 year ago

I'd love to get an answer to my questions in the second paragraph of https://github.com/microsoft/win32metadata/issues/636#issuecomment-1280872630 first.

kennykerr commented 1 year ago

For non-x86 the transformation is allowed because it effectively results in the same stack/register layout. For Rust for example, I just perform the transformation for all targets (x86, x64, ARM64) and it works everywhere.

AArnott commented 1 year ago

Thanks. I support closing this issue.