Open sa-exe opened 6 years ago
Hm, very odd. I have the same phone, so I'll give it a shot on mine sometime soon.
You're passing 0 for the depth parameter. Instead, that should be 1. I didn't actually test this on my Android yet, but running your code on my desktop I do get a validation error. Using 1 for the depth avoids the validation error. Can you give that a try and see if it fixes the issue?
In the meantime, I will add some validation to CommandList.CopyTexture
that ensures you specify a region that is valid.
Right, it should be 1. Still getting the exact same crash though.
I did a bit of investigating. It crashes in Release mode for me, but not otherwise. It doesn't seem to matter what kind of Texture or what the format is. I tried RGBA8_UNorm, and non-cubemap, and it still failed. Since it works in Debug mode, I wonder if there's some optimization that's corrupting something? I'll give try to give this a closer look tomorrow.
Hmm, on my device it crashes in Debug mode as well.
The format was supposed to be RGBA8_UNorm in both cases, but well, not that it matters. Also you're right, it doesn't even have to be a cubemap. Getting the same crash with 2 Sampled textures, both in Debug and Release modes.
Updated the issue and the repro project since the problem isn't specific to cubemaps.
Another update: it only crashes when using the 32-bit runtime. Here's how to make sure it's used instead of the ARM64 one: 1) Uncheck Android Options -> Use Shared Runtime in the project settings 2) In the Advanced settings, uncheck every architecture except armeabi-v7a.
With those settings, I see the same crash even in Debug mode, like you said. At this point, though, I'm really not sure how to get a root cause for this. When I use those settings, I'm not able to attach a debugger, get debug output, or get any meaningful information out from the crash. arm64-v8a is the only architecture that I can attach a debugger to and get information from, and Veldrid seems to work fine there.
Since calling conventions are on my mind lately, one guess is that there's something wrong there. The Vulkan docs aren't very up-front with this info, but there are clues in some header files regarding platform-specific calling conventions. I'm not sure if this could affect us. The Vulkan bindings currently use StdCall everywhere (and it's non-trivial to do platform-specific calling conventions). I'm not sure what that calling convention would map to in the .NET world.
Weird, I can use the debugger just fine. Here's the debug output I'm getting. I don't think there's much useful information in there except that it's a segfault. Meanwhile, I tried running this on Mono x86 on Windows -- no crash.
Hm, are you able to step through the code and get output from Debug.WriteLine, for example? Those aren't working for me, which is very frustrating. They only work if I use arm64-v8a.
Yes, the experience is exactly the same as when debugging an ARM64 app. As in, nothing is broken.
Now that I've enabled debug layers, this is the message I'm getting:
Source AccessMask 0 [None] must have required access bit 16384 [VK_ACCESS_HOST_WRITE_BIT] when layout is VK_IMAGE_LAYOUT_PREINITIALIZED, unless the app has previously added a barrier for this transition.
unless the app has previously added a barrier for this transition
Well, that's the thing: https://github.com/mellinoe/veldrid/blob/640458561b140dd4ca1610a20dd8610709651c9b/src/Veldrid/Vk/VulkanUtil.cs#L287
I believe it's a false positive.
Here's another weird error:
vkCmdCopyImage: value of pRegions[i].srcSubresource.aspectMask must not be 0
It's... not 0.
Alright, I've built the latest validation layers from source. Only getting this message now.
" [ VUID-VkImageSubresourceLayers-aspectMask-requiredbitmask ] Object: VK_NULL_HANDLE (Type = 0) | vkCmdCopyImage: value of pRegions[0].srcSubresource.aspectMask must not be 0. The spec valid usage text states 'aspectMask must not be 0' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-VkImageSubresourceLayers-aspectMask-requiredbitmask)"
Yeah, I'm not sure how that could be 0. On the other hand, it's really valuable information that there is a validation error on the exact function call that is crashing. Perhaps there is something going wrong in the struct layout or struct marshalling somehow?
I'm getting a slightly different error when using PInvoke instead of calli:
" [ VUID-vkCmdCopyImage-pRegions-parameter ] Object: VK_NULL_HANDLE (Type = 0) | vkCmdCopyImage: required parameter pRegions specified as NULL. The spec valid usage text states 'pRegions must be a valid pointer to an array of regionCount valid VkImageCopy structures' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-vkCmdCopyImage-pRegions-parameter)"
[DllImport("libvulkan.so")]
public static extern void vkCmdCopyImage(VkCommandBuffer commandBuffer, VkImage srcImage, VkImageLayout srcImageLayout, VkImage dstImage, VkImageLayout dstImageLayout, uint regionCount, VkImageCopy* pRegions);
Tried several calling conventions - the error is the same (yet it's different when using calli). I wonder what's so special about this function and its inputs.
Update: on second thought, it's probably not supposed to work, unless I use PInvoke for everything.
I don't think switching everything to PInvoke should make a difference. It seems like, for whatever reason, the pointer parameter is getting mangled or passed incorrectly somehow. Both error messages seem to point to that (the first message seems to indicate a valid pointer, but invalid data).
This is the code generated by Mono's AOT compiler in release mode:
LDR R1, [R0] ; vkCmdCopyImage
LDR R0, [R11,#commandBuffer]
LDR R2, [R11,#srcImage_lo]
LDR R3, [R11,#srcImage_hi]
LDR R12, [R11,#srcImageLayout]
STR R12, [SP]
LDR R12, [R11,#dstImage_lo]
STR R12, [SP,#4]
LDR R12, [R11,#dstImage_hi]
STR R12, [SP,#8]
LDR R12, [R11,#dstImageLayout]
STR R12, [SP,#0xC]
LDR R12, [R11,#regionCount]
STR R12, [SP,#0x10]
LDR R12, [R11,#pRegions]
STR R12, [SP,#0x14]
BLX R1
Essentially, this is what the registers and the stack look like when vkCmdCopyImage is called:
R0: commandBuffer
R1: points to the function
R2: srcImage (low word)
R3: srcImage (high word)
[SP + 0] srcImageLayout
[SP + 4] dstImage (low word)
[SP + 8] dstImage (high word)
[SP + C] dstImageLayout
[SP + 10] regionCount
[SP + 14] pRegions
And here's the equivalent (except working) code generated by clang:
LDR.W LR, [LR] ; vkCmdCopyImage
LDR R2, [SP,#0x700+commandBuffer]
LDR R3, [SP,#0x700+srcImage+4]
LDR.W R12, [SP,#0x700+srcImage]
LDR.W R4, [SP,#0x700+dstImage]
LDR.W R5, [SP,#0x700+dstImage+4]
MOV R6, SP
LDR.W R8, [SP,#0x700+pRegions]
STR.W R8, [R6,#0x18] ; pRegions
STR R0, [R6,#0x14] ; regionCount
MOVS R0, #7
STR R0, [R6,#0x10] ; dstImageLayout
STR R5, [R6,#0xC] ; dstImage (low word)
STR R4, [R6,#8] ; dstImage (high word)
MOVS R0, #6
STR R0, [R6] ; srcImageLayout
MOV R0, R2
MOV R2, R12
BLX LR
So the arguments are passed like this:
R0: commandBuffer
R1: unused?
R2: srcImage (low word)
R3: srcImage (high word)
[SP + 0] srcImageLayout
[SP + 8] dstImage (low word)
[SP + C] dstImage (high word)
[SP + 10] dstImageLayout
[SP + 14] regionCount
[SP + 18] pRegions
Notice where pRegions is stored. The location is different, and that's because here we have a 4 byte gap between srcImageLayout and dstImage. I don't have an explanation for this. Could be some sort of convention that Mono doesn't respect, I don't know. But the fact of the matter is, the implementation expects to find pRegions at [SP + 18], not [SP + 14].
I've already tried changing the function signature so that the arguments would end up at the corrent locations, and it worked. You can do that by adding an extra 4 byte parameter after srcImageLayout. Removing the extra parameter makes it crash again. I think this will only work in release mode, though.
@SomeAnon42 That is excellent information. Thanks for helping out with the debugging. Unless there's some other way for us to influence how the parameters are handled that we haven't already tried, then it seems to be a mono codegen bug.
I'd really like to see what RyuJIT generates on ARM32. But, unfortunately, I don't have any ARM devices that can run desktop Windows or Linux. Guess I'll just try .NET Native.
Hmm... .NET Native gives me this error:
Error: NUTC300D:Internal Compiler Error: Type 'Vulkan.VkComputePipelineCreateInfo[]' is not valid for use outside of the managed environment, but is passed to native code while compiling method 'static Vulkan.VkResult Vulkan.VulkanNative.vkCreateComputePipelines(Vulkan.VkDevice, Vulkan.VkPipelineCache, System.UInt32, Vulkan.VkComputePipelineCreateInfo[], Vulkan.VkAllocationCallbacks&, Vulkan.VkPipeline&)'
Nope, I don't think .NET Native will allow me to do what I want. It's a UWP tech, so it has some restrictions.
It looks to me like the issue is that the uint64 dstImage argument is not being aligned to an 8 byte boundary? I took a look at the ARM calling conventions and this seems to be required.
@mhutch Thanks confirming that. It definitely seems to be a mono codegen issue, specific only to 32-bit ARM.
I'd appreciate it if you could file things like this upstream in future - I only found it by chance 🙂
Re: this comment: I'm getting a slightly different error when using PInvoke instead of calli:
Where is the code which calls using calli ? You cannot call native code using calli in general, the runtime needs to add marshalling, etc. Also, the runtime is allowed to use a different calling convention in managed code vs. native code, so the calli needs to use the correct (pinvoke) calling convention.
Does the crash happen only when using the AOT compiler ? If so, can you attach the verbose build logs ?
Where is the code which calls using calli ?
It's generated as part of this rewriting step: https://github.com/mellinoe/vk/blob/master/src/vk.rewrite/Program.cs. Some marshalling is also done in there, but not for this overload, since it shouldn't be necessary.
the calli needs to use the correct (pinvoke) calling convention.
We talked about all of this above -- using different calling conventions doesn't make a difference. Also, there are dozens of other Vulkan functions that are being called successfully which are all declared with the same calling convention.
Does the crash happen only when using the AOT compiler ?
I can't recall that -- @SomeAnon42 Do you remember? I can't test this again until later tonight.
Does the crash happen only when using the AOT compiler ?
No, it happens when using the JIT compiler as well. I simply used the AOT compiler to get the disassembly, because that seemed like the easiest way to me.
The relevant code in the runtime is mini-arm.c:
https://github.com/vargaz/mono/blob/master/mono/mini/mini-arm.c#L1096
if (eabi_supported) {
/* darwin aligns longs to 4 byte only */
if (i8_align == 8) {
*stack_size += 7;
*stack_size &= ~7;
}
}
eabi_supported is set using:
#if defined(__ARM_EABI__)
eabi_supported = TRUE;
#endif
i8_align is set using:
i8_align = MONO_ABI_ALIGNOF (gint64);
which should be 8 on arm.
eabi_supported is only set during AOT when the aot compiler is passed the mtriple=arm-linux-gnueabi argument, which is possible. But eabi_supported is supposed to be set when running with the JIT.
@vargaz I think we hit a different case. You are assuming the parameter is MONO_TYPE_I8
, right?
But since the type is VkImageCopy* pRegions
, we go this path:
https://github.com/vargaz/mono/blob/b0f20fcd028811c24e2e6b82e623972cb6d94431/mono/mini/mini-arm.c#L1410-L1415 or
https://github.com/vargaz/mono/blob/master/mono/mini/mini-arm.c#L1435-L1440
and therefore simple = TRUE
https://github.com/vargaz/mono/blob/b0f20fcd028811c24e2e6b82e623972cb6d94431/mono/mini/mini-arm.c#L1069-L1075
where we don't respect eabi_supported
/i8_aligned
.
does that sound plausible?
This code crashes when running on Android/Vulkan:
Only tested it on a Xiaomi Redmi 4X (Android 7.1). You can find a small repro project here.