ikwzm / udmabuf

User space mappable dma buffer device driver for Linux.
BSD 2-Clause "Simplified" License
539 stars 165 forks source link

Could this be used with V4L2/libcamera buffers on the Raspberry Pi 4 (Arm A72) #107

Open octopus-russell opened 1 year ago

octopus-russell commented 1 year ago

Hi, We've come across this driver as a potential way of passing a userspace dma buffer to V4L2 instead of V4L2's default mmap mode which is rather slow. Here I see someone's done this achieving a 15x speedup: https://github.com/ikwzm/udmabuf/issues/38 Do you know if this module supports the Raspberry Pi 4? (ARM A72, Debian bullseye, kernel 6.1.21) Thanks Russell

ikwzm commented 1 year ago

Thanks for the issue.

I have only run it on ARM Cortex®-A53 (Xilinx Zynq Ultrascale+ MPSoC) and ARM Cortex®-A9 (Xilinx ZYNQ / Altera CycloneV SoC), I don't know if it works on ARM Cortex®-A72 (Raspberry Pi 4).

It may work on the ARM Cortex®-A72 (Raspberry Pi 4) since it has the same arm64 architecture as the A53.

Please someone give me some information.

kbingham commented 1 year ago

udmabuf likely isn't a good way to pass the buffers, but if you're experiencing issues with mmap'ing buffers indeed it's because they are likely in uncached memory.

ikwzm commented 1 year ago

udmabuf likely isn't a good way to pass the buffers, but if you're experiencing issues with mmap'ing buffers indeed it's because they are likely in uncached memory.

Here is a little explanation about the cache being turned off.

Performance issue with V4L2 streaming I/O (V4L2_MEMORY_MMAP)

Introduction

V4L2 streaming I/O (V4L2_MEMORY_MMAP) is a V4L2 streaming I/O scheme that maps V4L2 buffers allocated in the V4L2 driver (in the kernel) to user space using the mmap mechanism, allowing user programs to access V4L2 This method is used relatively often because it allows direct access to the V4L2 buffers from user space.

However, certain V4L2 drivers had a problem where caching was turned off when mapping to user space with mmap, resulting in very slow memory access and poor performance.

One V4L2 driver that causes this problem is Xilinx's Video DMA.

This topic describes the mechanism.

Mechanism of cache turn-off

There is a problem with the mmap of dma-contig in the V4L2 buffer memory allocator, which in some cases turns off the cache. Therefore, the cache is turned off in the mmap of the V4L2 driver that employs dma-contig.

Memory allocator for V4L2 buffer

There are three types of memory allocators for V4L2 buffers

Of these, the last one, dma-contig, is the most problematic.

vmalloc

vmalloc is a memory allocator for V4L2 drivers without DMA. For example, the V4L2 driver for USB Camera does this; in the case of USB, the USB device driver transfers data to and from the USB device, and the V4L2 driver itself does not directly transfer data to and from the USB device. Therefore, it allocates memory using vmalloc, which is normally used by the kernel.

dma-sg

dma-sg is a memory allocator for devices with DMA supporting Scatter Gather, which allows DMA transfers even when buffers are not contiguous in physical memory space. It allocates memory using the Linux kernel's dma_sg API.

dma-contig

dma-contig is a memory allocator for devices with DMA that does not support Scatter Gather. kernel's dma API to allocate memory. Actually, there is a problem with the mmap of this dma-contig, and the mmap of the V4L2 driver that uses this dma-contig may turn off the cache.

mmap for dma-contig

vb2_dc_mmap()

The mmap for dma-contig is as follows

https://elixir.bootlin.com/linux/v6.1.38/source/drivers/media/common/videobuf2/videobuf2-dma-contig.c#L274

static int vb2_dc_mmap(void *buf_priv, struct vm_area_struct *vma)
{
    struct vb2_dc_buf *buf = buf_priv;
    int ret;

    if (!buf) {
        printk(KERN_ERR "No buffer to map\n");
        return -EINVAL;
    }

    if (buf->non_coherent_mem)
        ret = dma_mmap_noncontiguous(buf->dev, vma, buf->size,
                         buf->dma_sgt);
    else
        ret = dma_mmap_attrs(buf->dev, vma, buf->cookie, buf->dma_addr,
                     buf->size, buf->attrs);
    if (ret) {
        pr_err("Remapping memory failed, error: %d\n", ret);
        return ret;
    }

    vma->vm_flags       |= VM_DONTEXPAND | VM_DONTDUMP;
    vma->vm_private_data    = &buf->handler;
    vma->vm_ops     = &vb2_common_vm_ops;

    vma->vm_ops->open(vma);

    pr_debug("%s: mapped dma addr 0x%08lx at 0x%08lx, size %lu\n",
         __func__, (unsigned long)buf->dma_addr, vma->vm_start,
         buf->size);

    return 0;
}

Do not consider buf->non_coherent_mem here. If buf->non_coherent_mem is TRUE, the buffer is allocated in non-contiguous space. Therefore, dma_mmap_attrs() will be called if the buffer is allocated in contiguous space.

dma_mmap_attrs()

dma_mmap_attrs() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/mapping.c#L457

int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
        void *cpu_addr, dma_addr_t dma_addr, size_t size,
        unsigned long attrs)
{
    const struct dma_map_ops *ops = get_dma_ops(dev);

    if (dma_alloc_direct(dev, ops))
        return dma_direct_mmap(dev, vma, cpu_addr, dma_addr, size,
                attrs);
    if (!ops->mmap)
        return -ENXIO;
    return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs);
}

On the arm64 architecture, dma_alloc_direct() is normally TRUE, so dma_direct_mmap() is called.

dma_direct_mmap()

dma_direct_mmap() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/direct.c#L555

int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma,
        void *cpu_addr, dma_addr_t dma_addr, size_t size,
        unsigned long attrs)
{
    unsigned long user_count = vma_pages(vma);
    unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT;
    unsigned long pfn = PHYS_PFN(dma_to_phys(dev, dma_addr));
    int ret = -ENXIO;

    vma->vm_page_prot = dma_pgprot(dev, vma->vm_page_prot, attrs);
    if (force_dma_unencrypted(dev))
        vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);

    if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, &ret))
        return ret;
    if (dma_mmap_from_global_coherent(vma, cpu_addr, size, &ret))
        return ret;

    if (vma->vm_pgoff >= count || user_count > count - vma->vm_pgoff)
        return -ENXIO;
    return remap_pfn_range(vma, vma->vm_start, pfn + vma->vm_pgoff,
            user_count << PAGE_SHIFT, vma->vm_page_prot);
}

Here, the cache is set by dma_pgprot().

dma_pgprot()

dma_pgprot() is as follows.

https://elixir.bootlin.com/linux/v6.1.38/source/kernel/dma/mapping.c#L415

#ifdef CONFIG_MMU
/*
 * Return the page attributes used for mapping dma_alloc_* memory, either in
 * kernel space if remapping is needed, or to userspace through dma_mmap_*.
 */
pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs)
{
    if (dev_is_dma_coherent(dev))
        return prot;
#ifdef CONFIG_ARCH_HAS_DMA_WRITE_COMBINE
    if (attrs & DMA_ATTR_WRITE_COMBINE)
        return pgprot_writecombine(prot);
#endif
    return pgprot_dmacoherent(prot);
}
#endif /* CONFIG_MMU */

Note that the macro dev_is_dma_coherent() is used here. dma_pgprot() does nothing if dev_is_dma_coherent() is true. dma_pgprot() returns the return value of pgprot_dmacoherent() if dev_is_dma_coherent() is false. And pgprot_dmacoherent() returns an architecture-dependent value. If the architecture is ARM64, pgprot_dmacoherent() returns the same value as pgprot_writecombine().

Conclusion

On arm64 architecture, V4L2 drivers employing dma-contig will turn off cache on mmap.

ArmandBENETEAU commented 11 months ago

Hi all!

Firstly, I want to sincerely say thank you to all the contributors to this project and more specifically to @ikwzm. It seems to be a really dynamic project, where all the issue have answers. This is great!

I am in a situation a bit like @octopus-russell. Indeed, I want to capture images from a camera using v4l2 and I need to write them at a given address in physical memory. Several solutions exist of course, but it seems from my point of view that using DMA is the most optimal way. In v4l2 it corresponds to the V4L2_MEMORY_DMABUF method.

So I've entered the dark world of DMA in Linux. After several hours/days of research around the web, I've come across this device-driver and I thought for one glorious second that I've had found the right way. But after several attempts, I've finally find out that I cannot export dma-buffer file descriptors using u-dma-buf...

Then, during another day I've tried to find a tool that allow me to export a dma-buffer file descriptor from a physical memory address... in vain. And a bit randomly, I encounter this issue, that is really close from what I want to do! Thus, may be that I can find the solution here.

Indeed @kbingham you said that "udmabuf likely isn't a good way to pass the buffers" and I'm wondering if you could give me a hint on how to do what I want? I.e: using V4L2_MEMORY_DMABUF method to write directly at a given address in physical memory.

Sorry for this looooong text, and have a good day!

PS1: unfortunately since December 2022, the strategy consisting in using V4L2_MEMORY_USERPTR method with u-dma-buf cannot work due to this commit

PS2: I am aware that may be this is not the right place to ask that, if it is the case, could you redirect me to the right place?

ikwzm commented 11 months ago

Thank you for your valuable information.

PS1: unfortunately since December 2022, the strategy consisting in using V4L2_MEMORY_USERPTR method with u-dma-buf cannot work due to this commit

I did not know that the V4L2_MEMORY_USERPTR method was no longer available. It would be a shame if it is no longer available.

This is not yet a decision, but I am currently trying to add the ability to export u-dma-bufs as dma-bufs. It is currently under development in this branch, but unfortunately it is not working yet. I can export, but when I try to remove u-dma-buf from the kernel, the kernel panics.

https://github.com/ikwzm/udmabuf/tree/dma-buf-export-develop

I still have a long way to go, but I will make it public when it is ready.

ArmandBENETEAU commented 11 months ago

You are welcome. It is still available but, as far as I understand, v4l2 refuses to use it if it implicates in the end direct writing to physical memory address.

Wow, having u-dma-bufs as dma-bufs would be great indeed! I hope you will succeed making it work.

Thank you for your answer.