ikwzm / udmabuf

User space mappable dma buffer device driver for Linux.
BSD 2-Clause "Simplified" License
539 stars 165 forks source link

(Again) memory cache management and coherency #79

Open pmdaye opened 3 years ago

pmdaye commented 3 years ago

Hello,

First of all, this is an AMAZING driver and I am really grateful that you provide it through github. I am working with an Zynqmp and petalinux 2020.2 There is something unclear to me for the manual cache management. The driver allocates a coherent memory region. I have a second module which interacts with the PL to provide the physical address (given by u_dma_buf) where our IP must directly writes. My module receives an interrupt when the data are written within the memory area by the PL (I am sure that there is no cache here as we are directly writing in the memory area). When I read the data, I see that I do have cache issues, even if I use sync_for_cpu. What am I missing? From my understanding, I should not have any cache issue as I only read some memory region.

Thanks,

PMD

pmdaye commented 3 years ago

By the way, we are not using the ACP. Could it be the problem?

ikwzm commented 3 years ago

Thank you for the issue.

However, I cannot answer because I do not know the situation you are encountering. Can you provide me with source code or device tree?

pmdaye commented 3 years ago

Sorry for the lack of description. This is my device tree ` reserved-memory {

address-cells = <2>;

    #size-cells = <2>;
    ranges;

    image_buf0: image_buf@0 {
        compatible = "shared-dma-pool";
        reusable;
        reg = <0x0 0x40000000 0x0 0x2000000>;
        label = "image_buf0";
   };
};

udmabuf@0 {
    compatible = "ikwzm,u-dma-buf";
    device-name = "p3-udmabuf0";
    size = <0x2000000>; // 64MiB
    memory-region = <&image_buf0>;
    dma-coherent;
};

To be more clear, we wrote a v4l2 driver that will use CMA buffers within the p3-udmabuf0 to access images written by the PL. Therefore we divide the p3-udmabuf0 into a series of sub memory range using thiis. I found in one of the issues that I have to initialize the memory region to ensure that It can be accessed by the PL afterwards if (int fd; (fd = open(_buf_file.c_str(), O_RDWR | O_SYNC)) != -1) { _mptr = mmap(NULL, _bufferSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (_mptr == MAP_FAILED) { std::error_code ec{errno, std::system_category()}; throw std::system_error(ec, "Can't map CMA memory"); } _fd = fd; if (init) { uint64_t word_ptr = reinterpret_cast<uint64_t>(_mptr); size_t words = _bufferSize / sizeof(uint64_t); for(int i = 0 ; i < words; i++) { word_ptr[i] = idx << 0 | idx << 8 | idx << 16 | idx << 24 | idx << 32 | idx << 40 | idx << 48 | idx << 56; } } get_prop("phys_addr") >> std::hex >> _phys_addr; get_prop("size") >> std::dec >> _bufferSize; _fd = -1; } else { return; }

_singleBufferPointers.clear();
_singleBufferPointers.resize(_n_buffers);

uint64_t* ptr = reinterpret_cast<uint64_t*>(_mptr);

setSyncDirection();

for (size_t ii = 0; ii < _n_buffers; ii++) {
    _singleBufferPointers[ii] = ptr + ii * _singleBufferSize / sizeof(uint64_t);
}`

Then we queued these buffers with v4l2. Once one image had been written in one of the buffer, we set the corresponding buffer as being ready and we dequeued it from the userspace. The results we have is that the first image is OK. However, if we have 32 buffers, the buffers from 1 to 31 are empty (they remain at their initialization value) but when we loop over the same buffers again, now we have our images... From all the analyses we made, it seems that it is an issue with cache coherency but we are unable to correct the issue.

This part of the code is responsible for dequeueing/enqueueing the buffer. ` CLEAR(buf); buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE; buf.memory = V4L2_MEMORY_USERPTR; xioctl(fd, VIDIOC_DQBUF, &buf); buffers.sync_for_cpu(buf.index); buffers.sync_for_cpu(buf.index);

pfile.addSaccadeFrame(kk, buffers.getMappedPtr(buf.index)); xioctl(fd, VIDIOC_QBUF, &buf);`

This is the function that should clean the cache problems (if I understood correctly)

`void p3_udmabuf::sync_for_cpu(uint32_t buf_idx) const {

char attr[1024];
const unsigned long  sync_offset = buf_idx * _singleBufferSize;
const unsigned long  sync_size = _singleBufferSize;
unsigned int   sync_direction = 2;
unsigned long  sync_for_cpu   = 1;
auto path = (sys_prefix + "/").append(_requested_buf).append("/sync_for_cpu");
if (int fd; (fd  = open(path.c_str(), O_WRONLY)) != -1) {
    snprintf(attr, sizeof(attr), "0x%08X%08X", (sync_offset & 0xFFFFFFFF), (sync_size & 0xFFFFFFF0) | (sync_direction << 2) | sync_for_cpu);
    write(fd, attr, strlen(attr));
    close(fd);
}

}`

(We also tried with sync_direction=1 but it doesn't work.

I hope that this is clearer and thank you in advance for the time you would take to read this rather long post...

ikwzm commented 3 years ago

It seems that the cache is not enabled because you specified the O_SYNC flag when opening the file. Maybe it's not a cache issue.

By the way, do you need buf.m.userptr = _singleBufferPointers [buf.index]; before xioctl (fd, VIDIOC_QBUF, &buf)?

pmdaye commented 3 years ago

Thanks for your reply. A little more context would maybe help you understanding what we are doing. The PL IP reads the data from an ONSemi Python300 sensor. The goal here is to minimize the amount of copy when trigger is emitted by the PL to signal that an image is ready. Using the approach of the userptr in v4l2 seems like the perfect solution combined with your driver. Therefore the temporal order we have is:

  1. We signal to v4l2 that it will have n buffers to work with (32 is the maximum)
  2. We create the corresponding number of buffers using u_dma_buf
  3. We queue all the created buffers with the corresponding index, the mmaped address and the size of the buffer (in userspace)
    • Within the kernel, the function that adds the buffer to the queue extract the corresponding physical address and sends it to a FIFO within the PL (kernel space)
    • Now, the PL has a FIFO with a series of physical addresses to which it has to write each image sequentially
  4. We start the video stream
    • We wait for an interrupt (in kernel space)
    • We signal to v4l2 that an buffer is ready to be exported to userspace
    • We dequeue the buffer (in userspace)
    • We memcpy the data (in userspace)
    • We requeue the buffer within the PL FIFO as in point 3.

To validate that the data are correctly written by the PL to the physical addresses, we built a baremetal example that runs smoothly and in which we can check that the data are correctly written (using a test pattern instead of the actual image sensor). HOWEVER!! To make working this baremetal software, we need to invalidate the DCacheLine using Xil_DCacheInvalidateRange(); We would like to keep the manual cache management as we have shown in the baremetal that it is way faster than without it (as expected I believe). I believed that I could make it work using sync_for_cpu with the sync_direction set to 2 to force and invalidate cache as explained in your documentation.

However the pattern we observe is very strange. To demonstrate the behavior, I logged different pieces of information.

The results are below:

sequence: 0, buf_idx: 0 ptr: 0xffff947dc000 phys: 40000000 val: 0x181e161e151d151c,171c161d161b161e,191c1a1d181b191b,1b1c1a1d1a1d1b1c sequence: 1, buf_idx: 1 ptr: 0xffff94872000 phys: 40096000 val: 0x161e161f171d151f,161e171d161e181e,101010101010101,101010101010101 sequence: 2, buf_idx: 2 ptr: 0xffff94908000 phys: 4012c000 val: 0x161c171b151c161d,171b161a151c151d,202020202020202,202020202020202 sequence: 3, buf_idx: 3 ptr: 0xffff9499e000 phys: 401c2000 val: 0x161f171f171d151f,1620161e1620181e,303030303030303,303030303030303 sequence: 4, buf_idx: 4 ptr: 0xffff94a34000 phys: 40258000 val: 0x151c151d151d141c,171d161d151d161e,404040404040404,404040404040404 sequence: 5, buf_idx: 5 ptr: 0xffff94aca000 phys: 402ee000 val: 0x171e171d161e171e,1820181f181e1620,505050505050505,505050505050505 sequence: 6, buf_idx: 6 ptr: 0xffff94b60000 phys: 40384000 val: 0x151c171b141e151b,161c161d181e151d,606060606060606,606060606060606 sequence: 7, buf_idx: 7 ptr: 0xffff94bf6000 phys: 4041a000 val: 0x161e151d181f151d,1820171f181e181e,707070707070707,707070707070707 sequence: 8, buf_idx: 8 ptr: 0xffff94c8c000 phys: 404b0000 val: 0x161d161c151d151c,161c151d171c151c,808080808080808,808080808080808 sequence: 9, buf_idx: 9 ptr: 0xffff94d22000 phys: 40546000 val: 0x171e171d1520171f,1720171f1820181e,909090909090909,909090909090909 sequence: 10, buf_idx: 10 ptr: 0xffff94db8000 phys: 405dc000 val: 0x151d151c151b141d,161c171c151c161c,a0a0a0a0a0a0a0a,a0a0a0a0a0a0a0a sequence: 11, buf_idx: 11 ptr: 0xffff94e4e000 phys: 40672000 val: 0x1620161f161d1620,1720181e1720161f,b0b0b0b0b0b0b0b,b0b0b0b0b0b0b0b sequence: 12, buf_idx: 12 ptr: 0xffff94ee4000 phys: 40708000 val: 0x141c151d141e141b,151d151b141e151f,c0c0c0c0c0c0c0c,c0c0c0c0c0c0c0c sequence: 13, buf_idx: 13 ptr: 0xffff94f7a000 phys: 4079e000 val: 0x182015201520151f,1620171f1620161f,d0d0d0d0d0d0d0d,d0d0d0d0d0d0d0d sequence: 14, buf_idx: 14 ptr: 0xffff95010000 phys: 40834000 val: 0x151e151f151d131c,141d141d131c151e,e0e0e0e0e0e0e0e,e0e0e0e0e0e0e0e sequence: 15, buf_idx: 15 ptr: 0xffff950a6000 phys: 408ca000 val: 0x1521141f15221620,1720162115211620,f0f0f0f0f0f0f0f,f0f0f0f0f0f0f0f sequence: 16, buf_idx: 16 ptr: 0xffff9513c000 phys: 40960000 val: 0x131c121c151e141d,131e141c151d141d,1010101010101010,1010101010101010 sequence: 17, buf_idx: 17 ptr: 0xffff951d2000 phys: 409f6000 val: 0x1520162115211621,161f152015201420,1111111111111111,1111111111111111 sequence: 18, buf_idx: 18 ptr: 0xffff95268000 phys: 40a8c000 val: 0x141f131c131c131e,131c141e131d141d,1212121212121212,1212121212121212 sequence: 19, buf_idx: 19 ptr: 0xffff952fe000 phys: 40b22000 val: 0x1521141f14201420,1521152115201620,1313131313131313,1313131313131313 sequence: 20, buf_idx: 20 ptr: 0xffff95394000 phys: 40bb8000 val: 0x131e141c121e141d,131d141d141d131d,1414141414141414,1414141414141414 sequence: 21, buf_idx: 21 ptr: 0xffff9542a000 phys: 40c4e000 val: 0x1521141f1520151f,1620141f16201620,1515151515151515,1515151515151515 sequence: 22, buf_idx: 22 ptr: 0xffff954c0000 phys: 40ce4000 val: 0x131e141e121d141f,141c141d141f131c,1616161616161616,1616161616161616 sequence: 23, buf_idx: 23 ptr: 0xffff95556000 phys: 40d7a000 val: 0x15211421151e1421,1620161f141e151e,1717171717171717,1717171717171717 sequence: 24, buf_idx: 24 ptr: 0xffff955ec000 phys: 40e10000 val: 0x141c141d141b141d,151e141c141d131d,1818181818181818,1818181818181818 sequence: 25, buf_idx: 25 ptr: 0xffff95682000 phys: 40ea6000 val: 0x1520151f141f1521,151f181f151f1520,1919191919191919,1919191919191919 sequence: 26, buf_idx: 26 ptr: 0xffff95718000 phys: 40f3c000 val: 0x151d141d131b121c,161e131d141d141d,1a1a1a1a1a1a1a1a,1a1a1a1a1a1a1a1a sequence: 27, buf_idx: 27 ptr: 0xffff957ae000 phys: 40fd2000 val: 0x162015211620161f,1720171e15201720,1b1b1b1b1b1b1b1b,1b1b1b1b1b1b1b1b sequence: 28, buf_idx: 28 ptr: 0xffff95844000 phys: 41068000 val: 0x141b141d141d131d,151e151f141d151d,1c1c1c1c1c1c1c1c,1c1c1c1c1c1c1c1c sequence: 29, buf_idx: 29 ptr: 0xffff958da000 phys: 410fe000 val: 0x1620152115201622,1820182017201620,1d1d1d1d1d1d1d1d,1d1d1d1d1d1d1d1d sequence: 30, buf_idx: 30 ptr: 0xffff95970000 phys: 41194000 val: 0x151e131d131d131f,161d161c131b151d,1e1e1e1e1e1e1e1e,1e1e1e1e1e1e1e1e sequence: 31, buf_idx: 31 ptr: 0xffff95a06000 phys: 4122a000 val: 0x1622162014201521,1622171f15211621,1f1f1f1f1f1f1f1f,1f1f1f1f1f1f1f1f sequence: 32, buf_idx: 0 ptr: 0xffff947dc000 phys: 40000000 val: 0x141e131d131e141f,161e151f161e151d,1d1c1a1e1c1c1a1c,1b1d1d1e1c1e1c1c sequence: 33, buf_idx: 1 ptr: 0xffff94872000 phys: 40096000 val: 0x1423172215201621,1621192017221620,1a1e181c191b181b,181c191d181f191c sequence: 34, buf_idx: 2 ptr: 0xffff94908000 phys: 4012c000 val: 0x151e141e151e121e,151f161e141f1520,1e1d1b1e1d1d1d1e,1e1f1d1e1b1f1c1c sequence: 35, buf_idx: 3 ptr: 0xffff9499e000 phys: 401c2000 val: 0x1423162316211522,1821182217231822,1a1c191b181b181c,191a181d191c191b sequence: 36, buf_idx: 4 ptr: 0xffff94a34000 phys: 40258000 val: 0x161e151e141f141f,151e151e131e1520,1f1f1b1e1c1d1a1c,1d1f1d1d1d1d1c1d sequence: 37, buf_idx: 5 ptr: 0xffff94aca000 phys: 402ee000 val: 0x1823162116221622,1623172217221621,1a1d1a1d181c191d,191c191c181b1a1b

It seems that only the first 16 bytes are updated with the sync_to_cpu. I am sincerely apologizing for the lack of clarity. I really would appreciate if you had any idea on how to solve this issue...

Thank you again!!

ikwzm commented 3 years ago

It seems that the cache is not enabled because you specified the O_SYNC flag when opening the file. Maybe it's not a cache issue. By the way, do you need buf.m.userptr = _singleBufferPointers [buf.index]; before xioctl (fd, VIDIOC_QBUF, &buf)?

What happened to the answer to this point?

hudan8080808 commented 1 year ago

What happened to the answer to this point? excuse me,have you solved this problem? Can you share the solution