[Android/Vulkan] CopyTexture crashes the app

sa-exe commented 6 years ago

This code crashes when running on Android/Vulkan:

var src = factory.CreateTexture(TextureDescription.Texture2D(
    texSize, texSize, 1, 1, PixelFormat.R8_G8_B8_A8_UNorm, TextureUsage.Sampled));
var dst = factory.CreateTexture(TextureDescription.Texture2D(
    texSize, texSize, 1, 1, PixelFormat.R8_G8_B8_A8_UNorm, TextureUsage.Sampled));

var cl = factory.CreateCommandList();
cl.Begin();
cl.CopyTexture(src, dst);
cl.End();

device.SubmitCommands(cl);
device.WaitForIdle();

Only tested it on a Xiaomi Redmi 4X (Android 7.1). You can find a small repro project here.

mellinoe commented 6 years ago

Hm, very odd. I have the same phone, so I'll give it a shot on mine sometime soon.

mellinoe commented 6 years ago

You're passing 0 for the depth parameter. Instead, that should be 1. I didn't actually test this on my Android yet, but running your code on my desktop I do get a validation error. Using 1 for the depth avoids the validation error. Can you give that a try and see if it fixes the issue?

mellinoe commented 6 years ago

In the meantime, I will add some validation to CommandList.CopyTexture that ensures you specify a region that is valid.

sa-exe commented 6 years ago

Right, it should be 1. Still getting the exact same crash though.

mellinoe commented 6 years ago

I did a bit of investigating. It crashes in Release mode for me, but not otherwise. It doesn't seem to matter what kind of Texture or what the format is. I tried RGBA8_UNorm, and non-cubemap, and it still failed. Since it works in Debug mode, I wonder if there's some optimization that's corrupting something? I'll give try to give this a closer look tomorrow.

sa-exe commented 6 years ago

Hmm, on my device it crashes in Debug mode as well.

sa-exe commented 6 years ago

The format was supposed to be RGBA8_UNorm in both cases, but well, not that it matters. Also you're right, it doesn't even have to be a cubemap. Getting the same crash with 2 Sampled textures, both in Debug and Release modes.

sa-exe commented 6 years ago

Updated the issue and the repro project since the problem isn't specific to cubemaps.

sa-exe commented 6 years ago

This is where it crashes.

sa-exe commented 6 years ago

Another update: it only crashes when using the 32-bit runtime. Here's how to make sure it's used instead of the ARM64 one: 1) Uncheck Android Options -> Use Shared Runtime in the project settings 2) In the Advanced settings, uncheck every architecture except armeabi-v7a.

mellinoe commented 6 years ago

With those settings, I see the same crash even in Debug mode, like you said. At this point, though, I'm really not sure how to get a root cause for this. When I use those settings, I'm not able to attach a debugger, get debug output, or get any meaningful information out from the crash. arm64-v8a is the only architecture that I can attach a debugger to and get information from, and Veldrid seems to work fine there.

mellinoe commented 6 years ago

Since calling conventions are on my mind lately, one guess is that there's something wrong there. The Vulkan docs aren't very up-front with this info, but there are clues in some header files regarding platform-specific calling conventions. I'm not sure if this could affect us. The Vulkan bindings currently use StdCall everywhere (and it's non-trivial to do platform-specific calling conventions). I'm not sure what that calling convention would map to in the .NET world.

sa-exe commented 6 years ago

Weird, I can use the debugger just fine. Here's the debug output I'm getting. I don't think there's much useful information in there except that it's a segfault. Meanwhile, I tried running this on Mono x86 on Windows -- no crash.

mellinoe commented 6 years ago

Hm, are you able to step through the code and get output from Debug.WriteLine, for example? Those aren't working for me, which is very frustrating. They only work if I use arm64-v8a.

sa-exe commented 6 years ago

Yes, the experience is exactly the same as when debugging an ARM64 app. As in, nothing is broken.

sa-exe commented 6 years ago

Now that I've enabled debug layers, this is the message I'm getting: Source AccessMask 0 [None] must have required access bit 16384 [VK_ACCESS_HOST_WRITE_BIT] when layout is VK_IMAGE_LAYOUT_PREINITIALIZED, unless the app has previously added a barrier for this transition.

unless the app has previously added a barrier for this transition

Well, that's the thing: https://github.com/mellinoe/veldrid/blob/640458561b140dd4ca1610a20dd8610709651c9b/src/Veldrid/Vk/VulkanUtil.cs#L287

I believe it's a false positive.

Here's another weird error: vkCmdCopyImage: value of pRegions[i].srcSubresource.aspectMask must not be 0 It's... not 0.

sa-exe commented 6 years ago

Alright, I've built the latest validation layers from source. Only getting this message now. " [ VUID-VkImageSubresourceLayers-aspectMask-requiredbitmask ] Object: VK_NULL_HANDLE (Type = 0) | vkCmdCopyImage: value of pRegions[0].srcSubresource.aspectMask must not be 0. The spec valid usage text states 'aspectMask must not be 0' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-VkImageSubresourceLayers-aspectMask-requiredbitmask)"

mellinoe commented 6 years ago

Yeah, I'm not sure how that could be 0. On the other hand, it's really valuable information that there is a validation error on the exact function call that is crashing. Perhaps there is something going wrong in the struct layout or struct marshalling somehow?

sa-exe commented 6 years ago

I'm getting a slightly different error when using PInvoke instead of calli: " [ VUID-vkCmdCopyImage-pRegions-parameter ] Object: VK_NULL_HANDLE (Type = 0) | vkCmdCopyImage: required parameter pRegions specified as NULL. The spec valid usage text states 'pRegions must be a valid pointer to an array of regionCount valid VkImageCopy structures' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-vkCmdCopyImage-pRegions-parameter)"

[DllImport("libvulkan.so")]
public static extern void vkCmdCopyImage(VkCommandBuffer commandBuffer, VkImage srcImage, VkImageLayout srcImageLayout, VkImage dstImage, VkImageLayout dstImageLayout, uint regionCount, VkImageCopy* pRegions);

Tried several calling conventions - the error is the same (yet it's different when using calli). I wonder what's so special about this function and its inputs.

Update: on second thought, it's probably not supposed to work, unless I use PInvoke for everything.

mellinoe commented 6 years ago

I don't think switching everything to PInvoke should make a difference. It seems like, for whatever reason, the pointer parameter is getting mangled or passed incorrectly somehow. Both error messages seem to point to that (the first message seems to indicate a valid pointer, but invalid data).

sa-exe commented 6 years ago

This is the code generated by Mono's AOT compiler in release mode:

LDR     R1, [R0] ; vkCmdCopyImage
LDR     R0, [R11,#commandBuffer]
LDR     R2, [R11,#srcImage_lo]
LDR     R3, [R11,#srcImage_hi]
LDR     R12, [R11,#srcImageLayout]
STR     R12, [SP]
LDR     R12, [R11,#dstImage_lo]
STR     R12, [SP,#4]
LDR     R12, [R11,#dstImage_hi]
STR     R12, [SP,#8]
LDR     R12, [R11,#dstImageLayout]
STR     R12, [SP,#0xC]
LDR     R12, [R11,#regionCount]
STR     R12, [SP,#0x10]
LDR     R12, [R11,#pRegions]
STR     R12, [SP,#0x14]
BLX     R1

Essentially, this is what the registers and the stack look like when vkCmdCopyImage is called:

R0: commandBuffer
R1: points to the function
R2: srcImage (low word)
R3: srcImage (high word)

[SP + 0] srcImageLayout
[SP + 4] dstImage (low word)
[SP + 8] dstImage (high word)
[SP + C] dstImageLayout
[SP + 10] regionCount
[SP + 14] pRegions

And here's the equivalent (except working) code generated by clang:

LDR.W       LR, [LR] ; vkCmdCopyImage
LDR         R2, [SP,#0x700+commandBuffer]
LDR         R3, [SP,#0x700+srcImage+4]
LDR.W       R12, [SP,#0x700+srcImage]
LDR.W       R4, [SP,#0x700+dstImage]
LDR.W       R5, [SP,#0x700+dstImage+4]
MOV         R6, SP
LDR.W       R8, [SP,#0x700+pRegions]
STR.W       R8, [R6,#0x18] ; pRegions
STR         R0, [R6,#0x14] ; regionCount
MOVS        R0, #7
STR         R0, [R6,#0x10] ; dstImageLayout
STR         R5, [R6,#0xC] ; dstImage (low word)
STR         R4, [R6,#8] ; dstImage (high word)
MOVS        R0, #6
STR         R0, [R6] ; srcImageLayout
MOV         R0, R2
MOV         R2, R12
BLX         LR

So the arguments are passed like this:

R0: commandBuffer
R1: unused?
R2: srcImage (low word)
R3: srcImage (high word)

[SP + 0] srcImageLayout
[SP + 8] dstImage (low word)
[SP + C] dstImage (high word)
[SP + 10] dstImageLayout
[SP + 14] regionCount
[SP + 18] pRegions

Notice where pRegions is stored. The location is different, and that's because here we have a 4 byte gap between srcImageLayout and dstImage. I don't have an explanation for this. Could be some sort of convention that Mono doesn't respect, I don't know. But the fact of the matter is, the implementation expects to find pRegions at [SP + 18], not [SP + 14].

I've already tried changing the function signature so that the arguments would end up at the corrent locations, and it worked. You can do that by adding an extra 4 byte parameter after srcImageLayout. Removing the extra parameter makes it crash again. I think this will only work in release mode, though.

mellinoe commented 6 years ago

@SomeAnon42 That is excellent information. Thanks for helping out with the debugging. Unless there's some other way for us to influence how the parameters are handled that we haven't already tried, then it seems to be a mono codegen bug.

sa-exe commented 6 years ago

I'd really like to see what RyuJIT generates on ARM32. But, unfortunately, I don't have any ARM devices that can run desktop Windows or Linux. Guess I'll just try .NET Native.

sa-exe commented 6 years ago

Hmm... .NET Native gives me this error:

Error: NUTC300D:Internal Compiler Error: Type 'Vulkan.VkComputePipelineCreateInfo[]' is not valid for use outside of the managed environment, but is passed to native code while compiling method 'static Vulkan.VkResult Vulkan.VulkanNative.vkCreateComputePipelines(Vulkan.VkDevice, Vulkan.VkPipelineCache, System.UInt32, Vulkan.VkComputePipelineCreateInfo[], Vulkan.VkAllocationCallbacks&, Vulkan.VkPipeline&)'

sa-exe commented 6 years ago

Nope, I don't think .NET Native will allow me to do what I want. It's a UWP tech, so it has some restrictions.

mhutch commented 6 years ago

It looks to me like the issue is that the uint64 dstImage argument is not being aligned to an 8 byte boundary? I took a look at the ARM calling conventions and this seems to be required.

mellinoe commented 6 years ago

@mhutch Thanks confirming that. It definitely seems to be a mono codegen issue, specific only to 32-bit ARM.

mhutch commented 6 years ago

I'd appreciate it if you could file things like this upstream in future - I only found it by chance 🙂

vargaz commented 6 years ago

Re: this comment: I'm getting a slightly different error when using PInvoke instead of calli:

Where is the code which calls using calli ? You cannot call native code using calli in general, the runtime needs to add marshalling, etc. Also, the runtime is allowed to use a different calling convention in managed code vs. native code, so the calli needs to use the correct (pinvoke) calling convention.

vargaz commented 6 years ago

Does the crash happen only when using the AOT compiler ? If so, can you attach the verbose build logs ?

mellinoe commented 6 years ago

Where is the code which calls using calli ?

It's generated as part of this rewriting step: https://github.com/mellinoe/vk/blob/master/src/vk.rewrite/Program.cs. Some marshalling is also done in there, but not for this overload, since it shouldn't be necessary.

the calli needs to use the correct (pinvoke) calling convention.

We talked about all of this above -- using different calling conventions doesn't make a difference. Also, there are dozens of other Vulkan functions that are being called successfully which are all declared with the same calling convention.

Does the crash happen only when using the AOT compiler ?

I can't recall that -- @SomeAnon42 Do you remember? I can't test this again until later tonight.

sa-exe commented 6 years ago

Does the crash happen only when using the AOT compiler ?

No, it happens when using the JIT compiler as well. I simply used the AOT compiler to get the disassembly, because that seemed like the easiest way to me.

vargaz commented 6 years ago

The relevant code in the runtime is mini-arm.c:

https://github.com/vargaz/mono/blob/master/mono/mini/mini-arm.c#L1096

        if (eabi_supported) {
            /* darwin aligns longs to 4 byte only */
            if (i8_align == 8) {
                *stack_size += 7;
                *stack_size &= ~7;
            }
        }

eabi_supported is set using:

#if defined(__ARM_EABI__)
    eabi_supported = TRUE;
#endif

i8_align is set using:

    i8_align = MONO_ABI_ALIGNOF (gint64);

which should be 8 on arm.

eabi_supported is only set during AOT when the aot compiler is passed the mtriple=arm-linux-gnueabi argument, which is possible. But eabi_supported is supposed to be set when running with the JIT.

lewurm commented 6 years ago

@vargaz I think we hit a different case. You are assuming the parameter is MONO_TYPE_I8, right? But since the type is VkImageCopy* pRegions, we go this path: https://github.com/vargaz/mono/blob/b0f20fcd028811c24e2e6b82e623972cb6d94431/mono/mini/mini-arm.c#L1410-L1415 or https://github.com/vargaz/mono/blob/master/mono/mini/mini-arm.c#L1435-L1440

and therefore simple = TRUE https://github.com/vargaz/mono/blob/b0f20fcd028811c24e2e6b82e623972cb6d94431/mono/mini/mini-arm.c#L1069-L1075 where we don't respect eabi_supported/i8_aligned.

does that sound plausible?

veldrid / veldrid

[Android/Vulkan] CopyTexture crashes the app #83