vulkano-rs / vulkano

Safe and rich Rust wrapper around the Vulkan API
Apache License 2.0
4.55k stars 437 forks source link

Segfault when creating a buffer with exportable fd on Linux after updating to 0.22 #1545

Closed hakolao closed 2 years ago

hakolao commented 3 years ago

After updating to 0.22 I get segfault when creating a DeviceLocalBuffer::raw_with_exportable_fd.

Issue

Segfault after #1510. Segfaulting at: ptr_chain_iter image

It seems to be related to the allocation size somehow. I don't have much clue what the changes are for in above linked PR that broke the functionality for me, but just reporting this is happening. 34mb size seems to be enough to trigger this.

I'll use the previous versions meanwhile.

main.rs:

use vulkano::buffer::{BufferUsage, DeviceLocalBuffer};
use vulkano::device::{Device, DeviceExtensions};
use vulkano::instance::{Instance, InstanceExtensions, PhysicalDevice};

fn main() {
    // Add instance extensions based on needs
    let instance_extensions = InstanceExtensions {
        khr_surface: true,
        khr_xlib_surface: true,
        khr_get_physical_device_properties2: true,
        khr_get_surface_capabilities2: true,
        ..InstanceExtensions::none()
    };
    // Create instance
    let instance =
        Instance::new(None, &instance_extensions, None).expect("Failed to create instance");
    // Get most performant device (physical)
    let physical = PhysicalDevice::enumerate(&instance)
        .fold(None, |acc, val| {
            if acc.is_none() {
                Some(val)
            } else if acc.unwrap().limits().max_compute_shared_memory_size()
                >= val.limits().max_compute_shared_memory_size()
            {
                acc
            } else {
                Some(val)
            }
        })
        .expect("No physical device found");
    println!(
        "Using device: {} (type: {:?})",
        physical.name(),
        physical.ty()
    );
    let queue_family = physical
        .queue_families()
        .find(|&q| q.supports_graphics())
        .expect("couldn't find a graphical queue family");
    let device_extensions = DeviceExtensions::supported_by_device(physical);
    let features = physical.supported_features();
    // Needed extension for shared buffers
    assert!(device_extensions.khr_external_memory_fd);
    let (device, mut _queues) = {
        Device::new(
            physical,
            &features,
            &device_extensions,
            [(queue_family, 0.5)].iter().cloned(),
        )
        .expect("failed to create device")
    };

    let vk_buffer = unsafe {
        DeviceLocalBuffer::<[u8]>::raw_with_exportable_fd(
            device,
            34000000, // !ToDo: Try changing this value to see when 11: SIGSEGV happens
            BufferUsage::all(),
            Vec::new(),
        )
        .expect("Failed to allocate device local buffer")
    };
    println!("{:#?}", vk_buffer);
}
hakolao commented 3 years ago

cc @gurchetansingh, you might know about this having worked on the PR

gurchetansingh commented 3 years ago

Can you simply your test app further?

DeviceMemoryBuilder::new(device.clone(), memory_type.id(), size).export_info(handle_type).build()?;

could also exhibit the problem

hakolao commented 3 years ago

I'll try to get a moment to debug further soon!

gurchetansingh commented 3 years ago

Thank you @hakolao! This is mostly like pointer thrashing.

gdb --args ./your_app break ptr_chain_iter

could be chance to improve ur gdb skills.

hakolao commented 3 years ago

Findings with clion's debugger (played with gdb too, but clion's UI won)

I think that as we reinterpret the BaseStructure inside push_next wrong. Its data was sometimes wrong (E.g. size of memory in place of s_type and also it would be safer to #[repr(C)] the BaseOutStructure.

Also, it would be nice to copy paste the comments from ash when shamelessly copying internal functions from there :D.

After a while of attempts the PR below seemed to work.

hakolao commented 3 years ago

Our exported & cuda imported memory also worked with this fix :)

dadleyy commented 3 years ago

hey all - I'm running into a segfault that I believe to be related to this issue. I've tried my best to whittle a reproduction down to the minimal amount of code here.

I am running this against the latest master by using cargo's [patch.crates-io] override feature, using a local copy of this repo that is up to date with master, including the change from 7db41e5.

the last bit of output before the segfault:

VUID-VkMemoryAllocateInfo-pNext-pNext(ERROR / SPEC): msgNum: -633955574 - Validation Error: [ VUID-VkMemoryAllocateInfo-pNext-pNext ] Object 0: handle = 0x7fcdac848418, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xda369b0a | vkAllocateMemory: pAllocateInfo->pNext chain includes a structure with unknown VkStructureType (-1409041408); Allowed structures are [VkDedicatedAllocationMemoryAllocateInfoNV, VkExportMemoryAllocateInfo, VkExportMemoryAllocateInfoNV, VkExportMemoryWin32HandleInfoKHR, VkExportMemoryWin32HandleInfoNV, VkImportAndroidHardwareBufferInfoANDROID, VkImportMemoryFdInfoKHR, VkImportMemoryHostPointerInfoEXT, VkImportMemoryWin32HandleInfoKHR, VkImportMemoryWin32HandleInfoNV, VkMemoryAllocateFlagsInfo, VkMemoryDedicatedAllocateInfo, VkMemoryOpaqueCaptureAddressAllocateInfo, VkMemoryPriorityAllocateInfoEXT]. This error is based on the Valid Usage documentation for version 170 of the Vulkan header.  It is possible that you are using a struct from a private extension or an extension that was added to a later version of the Vulkan header, in which case the use of pAllocateInfo->pNext is undefined and may not work correctly with validation enabled The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid instance of VkDedicatedAllocationMemoryAllocateInfoNV, VkExportMemoryAllocateInfo, VkExportMemoryAllocateInfoNV, VkExportMemoryWin32HandleInfoKHR, VkExportMemoryWin32HandleInfoNV, VkImportAndroidHardwareBufferInfoANDROID, VkImportMemoryFdInfoKHR, VkImportMemoryHostPointerInfoEXT, VkImportMemoryWin32HandleInfoKHR, VkImportMemoryWin32HandleInfoNV, VkMemoryAllocateFlagsInfo, VkMemoryDedicatedAllocateInfo, VkMemoryOpaqueCaptureAddressAllocateInfo, or VkMemoryPriorityAllocateInfoEXT (https://vulkan.lunarg.com/doc/view/1.2.170.0/mac/1.2-extensions/vkspec.html#VUID-VkMemoryAllocateInfo-pNext-pNext)
    Objects: 1
        [0] 0x7fcdac848418, type: 3, name: NULL
zsh: segmentation fault  cargo run --release

the only thing jumps out to me is the part about the sType of some pNext pointer:

vkAllocateMemory: pAllocateInfo->pNext chain includes a structure with unknown VkStructureType (-1409041408);

I do not have much experience debugging memory issues (including using gdb) yet, so I'm not sure what to look for. I also was pretty stumped when the teapot example on my local clone of this repo runs fine, since it is also using AttachmentImage::transient, which appears to be the call that causes the segfault. All this is to say that I don't have much confidence there even is an issue here, or if this is somehow isolated to my machine. I believe that cloning the repo linked above from another machine would help answer that quickly. I can see if I can find one to use in the meantime.

os macos v10.15.7
cargo --version cargo 1.51.0 (43b129a20 2021-03-16)

If there is anything I can do to help or more information I can provide, I would be happy to help. I really appreciate all the work going into this project.

gurchetansingh commented 3 years ago

Did you try top of tree with the fix?

https://github.com/vulkano-rs/vulkano/pull/1554

dadleyy commented 3 years ago

I could've sworn I was. I've changed the override to use a git resolution:

[patch.crates-io]
vulkano = { git = 'https://github.com/vulkano-rs/vulkano', rev = 'fb187e888469ac2255855ffa8b7022258a688222' }

instead of using { path = '...' }. now I'm only seeing the segfault when running with --release (cargo run --release).

gurchetansingh commented 3 years ago

@dadleyy I tried vulkano-seg-1545, but did not have any segmentation fault.

smithay_client_toolkit::window::concept_frame] No font could be found

Maybe it's an architecture specific issue? Also I changed this for it to compile:

 [dependencies]
-vulkano = { version = "0.22" }
-vulkano-win = { version = "0.22" }
+vulkano = { path = "../vulkano/vulkano" }
+vulkano-win = { path = "../vulkano/vulkano-win" }
 winit = { version = "0.24.0" }
 env_logger = { version = "0.8.3" }
 log = { version = "^0.4.1" }
-
-[patch.crates-io]
-vulkano = { git = 'https://github.com/vulkano-rs/vulkano', rev = 'fb187e888469ac2255855ffa8b7022258a688222' }
dadleyy commented 3 years ago

@gurchetansingh what architecture/environment are you running on? I was only experiencing this when running --release builds, did you try that?

smithay_client_toolkit::window::concept_frame] No font could be found

I'm confused - did you mean to include this in your comment? was that output from running vulkano-seg-1545?


I'm still experiencing this issue, but I have been slowly poking away at what I might think is the issue. I just finished implementing some changes that appear to fix this issue on this branch of my fork, though I'm not 100% certain why it would be a solution.

I'm thinking this has to do with the memory pointed to when using the push_next + ptr_chain_iter being invalidated somehow - which explains why the sType ends up as garbage like -1409041408: if we push_next(&mut T) and then the location of T changes at some point, the pNext pointer on our MemoryAllocateInfo would be invalid right?

I'm definitely still learning rust so this is all just me spitballing.

hakolao commented 3 years ago

I also attempted this with the --release just now, and it segfaulted :S. Looks like the fix wasn't enough indeed.

Works without --release.

hakolao commented 3 years ago

A better fix attempt... #1580. Hopefully this would fix it once and for all. Rust can sometimes make ptr play so hard...

It kind of baffles me why this would work in non-release mode and fail in release mode.

pwaller commented 2 years ago

I'm seeing this on head today, except with AttachmentImage::new_with_exportable_fd -- and again, it crashes with a --release build but not a debug build.

pwaller commented 2 years ago

The above issue I think is caused by this, and separate to the original issue as reported, sorry: https://github.com/vulkano-rs/vulkano/pull/1734/files#r771974391

Rua commented 2 years ago

The dropped temporary is fixed now, so if that was the problem here, it should be fixed too. I'll reopen if it's still present.