raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.17k stars 5.01k forks source link

remap_pfn_range is not able to remap memory to userspace for DMA-Memory allocated with dma_alloc_coherent with right cache attributes #4680

Open berndbenner opened 3 years ago

berndbenner commented 3 years ago

detected on cm4 raspberry-pi - aarch64 and arm-kernel 5.10.

remap_pfn_range will (re)map memory allocated with dma_alloc_coherent with wrong tlb cache settings.

A user mode process reading the mapped ( DMA- Buffer ) memory from the CPU will read invalid memory values, not the content transfered by a PCIe-Busmaster-device ( Xilinx FPGA board ).

remap_pfn_range will work on X86/AMD64 Architecture without any special settings ( only vma->vm_flags |= VM_RESERVED; )

All attempts to preset vma->vm_page_prot will also bring invalid memory read results from this area.

vma->vm_flags |= VM_RESERVED;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);

// no success with this
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

#if  defined(__aarch64__) || defined(__arm__)
// no success with this
// macro pgprot_dmacoherent not defined on  x86/AMD64 Arch
vma->vm_page_prot = pgprot_dmacoherent(vma->vm_page_prot);
#endif

The only quick hack I found was to use the DMA-API function dma_mmap_coherent instead of remap_pfn_range on arm/aarch64 achitecture in the device driver.

#if  defined(__aarch64__) || defined(__arm__) 
    err = dma_mmap_coherent(pci_dev_to_dev(md->pdev),
                      vma,
                      (void *)(vma->vm_start),
                      md->ci.dmabuf_info [dma_index] .physical_address,
                      (size_t)(vma->vm_end - vma->vm_start));
#else    
     err= remap_pfn_range (vma,
         ...
#endif

I think this is a BUG ( not a feature ) of the ARM/AARCH64 Architecture or the BCM2711 implementation.

pelwell commented 3 years ago

I think this is a BUG ( not a feature ) of the ARM/AARCH64 Architecture or the BCM2711 implementation.

Do you have some supporting evidence for this assertion?

berndbenner commented 3 years ago

No! ( I have not found any detailed specification which type of memory remap_pfn_range is able to remap ). Ok ! May be, its a feature, and it is desired that every driver module source has to use conditional compilation for each supported architecture. In this case I'm too stupid to understand the advantages of the ARCH-Subsystem and the encapsulation of architectures.

pelwell commented 3 years ago

As far as I know - and I'm not an expert - ARMs don't have a cache-coherency mechanism for DMA and other devices. If you want the cores to see the result of all memory accesses by a device then it has to be mapped as uncached. This can kill performance unless you are doing wide accesses, so it is usually more efficient to leave the mapping as cached and perform the necessary cache maintenance operations before and after the device transfer.

Does the dma_mmap_coherent call not work on X86? It's fairly normal for the kernel to have functions for operations that may be necessary on some architectures, and for either trivial or empty implementations on others. In this case, that might mean that dma_mmap_coherent collapses down into a call to remap_pfn_range.

berndbenner commented 3 years ago

dma_mmap_coherent will not working on X86.
dma_mmap_coherent is not listed in (https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt ](https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt)[ https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt ] or (https://www.kernel.org/doc/html/latest/core-api/dma-api.html](https://www.kernel.org/doc/html/latest/core-api/dma-api.html) https://www.kernel.org/doc/html/latest/core-api/dma-api.html]

May be a little exotic!
May be, it works with a other allocation functions like pci_alloc_consistent etc, but this will not work on ARM/AARCH64/cm4.

Our FPGA-Board and the driver is originally running on x86/AMD64 with other API functions ( pci-..).
My main task of Porting and testing it on ARM/ARM64/rpi-cm4 was to find a combination of Kernel-API-functions working on both architectures. Currently "dma_mmap_coherent" is the last one, with the need of architecture conditional compilation.

To your tip:

I have tried to remap uncached, by setting:

vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

before the call of remap_pfn_range.

I am also not a ARM expert, but my understanding of the found ARM8/BCM2711 documentation is, that ARM8/9 has a address snooping logic, with cache line invalidation to allow cache enable with bus-master-dma-devices.
Hardware managed cache coherency is a big enhancement. ( Software managed cache coherency in multi-core, multi master systems is the hell and uncached DDR- RAM Access is very slow )

I guess that dma_alloc_coherent will setup the tlbs of the MMU correct. ( dma_alloc_coherent is used in many gigibit-network drivers, Intel, Realtek .., to allocate the TX-, RX- bus mastering-dma ring-buffers. ). I have tested this network-boards and drivers on raspberry pi cm4. They work well.

So I guess first mapping to kernel space works. remap to other logical addresses works not. ( Trouble copy cache attributes ??? )

I can live with this issue !!! My driver module works with conditional compilation on X86/AMD64 and cm4- ARM/AARCH64.

I have open this issue to allow and trigger enhancements of the raspberry pi-kernel for the use of rpi-cm4 with Industrial PCIe data acquisition boards.

P33M commented 3 years ago

I am also not a ARM expert, but my understanding of the found ARM8/BCM2711 documentation is, that ARM8/9 has a address snooping logic, with cache line invalidation to allow cache enable with bus-master-dma-devices.

There is no cache coherency between any of the bus mastering peripherals on BCM2711 and the CPU. All bus masters exist outside of the outer shareable domain.

If the function you're trying to call isn't listed in the kernel DMA API documentation, then you shouldn't use it - they are typically private..

dma_alloc_coherent() is one way to resolve synchronisation of memory between CPU and device, but this is generally slow, as it forces invalidations / flushes on every access to the buffer. If your data flow is well-defined, i.e. unidirectional with clear fences and synchronisation mechanisms, then the dma functions that sync buffers for_device and for_cpu will do the correct cache maintenance operation for you.

berndbenner commented 2 years ago

I'm a little confused about the comments of the kernel contributors! My opinion is, that the memory throughput of bus-master-dma-transfer is a secondary problem. ( x86/AMD64 MMU snooping logic does not much more than marking cache lines as invalid, triggering a slow re-read from DDR-RAM on next access, may be it is more fine-grained, and so more performant as on ARM/AARCH64, but there is no magic cache coherency hardware implemented! The Write-Combining Capabilities are another topic. I have also read some performance issue reports on using dma_alloc_coherent., but it is used in many driver sources in the kernel tree for Allocation of DMA-memory ! ) The main task is, to ensure that the CPU will read valid content from a logical memory address, remapped from kernel to user-space, a common technology used for real time data acquisition applications. ( with a well defined uni-directional hard-/software synchronisation mechanisms ). The intended API-function to handle mmap remap_pfn_range, working well on x86/AMD64- architecture, using the same subsystem-calls, is unable to ensure this on rpi-cm4, with all passed parameters, that seems sensible for DMA-memory (including uncached), but may be, It is not a BUG?.
The workaround DMA-subsystem-function I found, should not be called, because its a private subsystem-function.
I see no solution proposal! So: Sorry for reporting a issue to the contributors of the rpi-kernel!