vsergeev / python-periphery

A pure Python 2/3 library for peripheral I/O (GPIO, LED, PWM, SPI, I2C, MMIO, Serial) in Linux.
MIT License
519 stars 139 forks source link

Duplicate writes using MMIO [Zynq / Yocto / Petalinux / Xilinx / ARM64] #50

Closed paul-demo closed 3 years ago

paul-demo commented 3 years ago

I found on a Zynq device that python-periphery MMIO duplicated writes compared to using devmem. Any thoughts on why this could be? Does it have under the hood some kind of "retry" where it would perform the write twice? The device has a 64-bit ARM64 architecture (quad core A53 I believe).

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/axi-fifo-mm-s-AXI-FIFO-duplicates-words-requires-wrong-TX-length/m-p/1200351#M56397

paul-demo commented 3 years ago

I see this library is a thin wrapper around mmap; in that case perhaps the issue is that line 65 of mmio.py should specify O_DIRECT in addition to O_SYNC. What do you think? It is apparently not available on all platforms. I could check whether this actually makes a difference on the Zynq platform.

line 65: fd = os.open(path, os.O_RDWR | os.O_SYNC | getattr(os, "O_DIRECT", 0))

References: source: https://man7.org/linux/man-pages/man2/open.2.html https://docs.python.org/3/library/os.html#os.O_DIRECT

vsergeev commented 3 years ago

Interesting. We should try to isolate whether it's happening in Python (I'm a little suspicious of the slice assignments in mmio.py) or some problem with the mapping flags as you brought up. Would you mind trying a write with lua-periphery or c-periphery? Those implementations are essentially the same as the Python one, except the actual reads/writes are simple pointer dereferences.

paul-demo commented 3 years ago

I've confirmed with a logic analyzer that the write32 and read32 are in fact duplicated when using periphery and they are not duplicated when using devmem. I also tried O_DIRECT but that did not work and I got errors at the file-opening step due to O_DIRECT being an invalid argument. I tried it with O_RDWR always included, but with O_SYNC either included or not included (error in both cases). Here's the logic analyzer screenshots as proof if you want to see what an AXI transaction looks like:

Using devmem to write:

Screen Shot 2021-02-04 at 12 10 47 PM

Using periphery to write32:

Screen Shot 2021-02-04 at 12 11 53 PM
paul-demo commented 3 years ago

c-periphery appears to work as intended (no duplicate writes), so that suggests it's some problem with the python version or the mmap module you're using, and not some weirdness with the device itself. I just stripped down your test_mmio and compiled it natively:

#include "../src/mmio.h"

#define CONTROL_MODULE_BASE     0x80000000
#define PAGE_SIZE               4096

int main(void) {
  mmio_t *mmio;
  mmio = mmio_new();
  mmio_open(mmio, CONTROL_MODULE_BASE, PAGE_SIZE);
  mmio_write32(mmio, 0x0, 0x0);
  mmio_close(mmio);
  mmio_free(mmio);
  return 0;
}
paul-demo commented 3 years ago

The version of Yocto [Zeus] that my build is based on includes v1.1.1 of python-periphery; but after reviewing whether that has any important differences compared to v2.2.0 that I get using pip3 on my desktop, I think it does not and they are basically the same implementation.

paul-demo commented 3 years ago

The underlying mmap usage also duplicates transactions, so maybe it's a lower level thing related to the fd open step.

This produces 2 writes:

mapping.seek(0); mapping.write(bytearray([0x0,]* 4))

This produces 4 writes:

mapping.seek(0); mapping.write(bytearray([0x0,]* 8))
paul-demo commented 3 years ago

Ok here's what works (as suggested here: https://stackoverflow.com/questions/53492716/python-writing-to-memory-in-a-single-operation). This seems like a bug in mmap, which I've reported here https://bugs.python.org/issue43131).

import os, mmap
fd = os.open("/dev/mem", os.O_RDWR | os.O_SYNC)
mapping = mmap.mmap(fd, 4096, flags=mmap.MAP_SHARED, prot=(mmap.PROT_READ | mmap.PROT_WRITE), offset=0x80000000)
mv = memoryview(mapping)
mv_as_int32 = mv.cast('I')  # Note "I" works as intended, "L" still results in duplicate writes; "LL" is not an option
mv_as_int32[0x0 // 4] = 0xDEADBEEF  # this works; results in 1 write issued on the AXI bus

I guess I will just not use python-periphery and instead use this lower-level workaround directly.

vsergeev commented 3 years ago

Thanks for testing and getting to the bottom of it @paul-demo. I'll have to migrate the MMIO module to use these more accurate writes. Unfortunately, it seems that the .cast() method on memoryview() isn't available in Python 2, so there will have to be another workaround there.

While you still have your ILA and environment still setup, would you mind checking if writes through the smaller casts, e.g. mv.cast('B') and mv.cast('H'), result in smaller single transfers on the AXI bus? e.g. with AWSIZE = 0 and 1, respectively.

Also, another approach I'm curious about testing is dereferencing the memoryview() through a ctypes pointer, e.g.:

import ctypes

# mapping code as before, or mapping = bytearray(b"\x88\x77\x66\x55\x44\x33\x22\x11") for testing

mv = memoryview(mapping)

p = ctypes.c_uint32.from_buffer(mv, 0)
print(hex(p.value)) # -> 0x55667788
p.value = 0xaabbccdd

p = ctypes.c_uint16.from_buffer(mv, 0)
print(hex(p.value)) # -> 0xccdd

This approach can be made to work with Python 2 as well.

edit: fixing AWSIZE values for 1 and 2 byte transfers

paul-demo commented 3 years ago

Sure, I don't mind testing some different workarounds in the next day or so. If you could package up all the versions you want tested in a separate function for each, I can re-trigger the ILA and run each one and report the results that I see. You can just read/write any value to 0x80000000; that's a 32-bit GPIO that I used for my testing.

vsergeev commented 3 years ago

Thanks, I really appreciate it. Whenever you have a chance is fine. You've just got the perfect setup for this at the moment. Here is a test script with every combination (12 total -- 6 for memoryview cast, 6 for ctypes):

import os
import mmap
import ctypes
import sys

def mv_write32(mapping):
    mv = memoryview(mapping)
    p = mv.cast('I')
    p[0] = 0xdeadbeef

def mv_write16(mapping):
    mv = memoryview(mapping)
    p = mv.cast('H')
    p[0] = 0xbeef

def mv_write8(mapping):
    mv = memoryview(mapping)
    p = mv.cast('B')
    p[0] = 0xaa

def mv_read32(mapping):
    mv = memoryview(mapping)
    p = mv.cast('I')
    print(hex(p[0]))

def mv_read16(mapping):
    mv = memoryview(mapping)
    p = mv.cast('H')
    print(hex(p[0]))

def mv_read8(mapping):
    mv = memoryview(mapping)
    p = mv.cast('B')
    print(hex(p[0]))

def ctypes_write32(mapping):
    p = ctypes.c_uint32.from_buffer(mapping, 0)
    p.value = 0xdeadbeef

def ctypes_write16(mapping):
    p = ctypes.c_uint16.from_buffer(mapping, 0)
    p.value = 0xbeef

def ctypes_write8(mapping):
    p = ctypes.c_uint8.from_buffer(mapping, 0)
    p.value = 0xaa

def ctypes_read32(mapping):
    p = ctypes.c_uint32.from_buffer(mapping, 0)
    print(hex(p.value))

def ctypes_read16(mapping):
    p = ctypes.c_uint16.from_buffer(mapping, 0)
    print(hex(p.value))

def ctypes_read8(mapping):
    p = ctypes.c_uint8.from_buffer(mapping, 0)
    print(hex(p.value))

TESTS = [mv_write32, mv_write16, mv_write8, mv_read32, mv_read16, mv_read8,
         ctypes_write32, ctypes_write16, ctypes_write8, ctypes_read32, ctypes_read16, ctypes_read8]

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: {} <test>\n".format(sys.argv[0]))
        print("Available tests:\n    " + "\n    ".join([func.__name__ for func in TESTS]))
        sys.exit(0)

    fd = os.open("/dev/mem", os.O_RDWR | os.O_SYNC)
    mapping = mmap.mmap(fd, 4096, flags=mmap.MAP_SHARED, prot=(mmap.PROT_READ | mmap.PROT_WRITE), offset=0x80000000)

    globals()[sys.argv[1]](mapping)

If you could run the 6 mv and 6 ctypes ones under Python 3, and then 6 ctypes under Python 2 (the mv ones won't work under Python 2), that would be awesome. You can skip the mv write32 and read32 cases, as it appears you've verified those above, but I included them for completeness.

paul-demo commented 3 years ago

It looks like all 12 of those methods work in python3; I don't have python2 installed in this build so can't test those (python2 is no longer supported anyway; time to move on). I increased the ILA capture length and put them all in a loop, which also shows that you can do 12 transactions in about 30,000 cycles of the ~100MHz bus, or roughly 25 microseconds per read or write operation. I zoomed in and confirmed that each ready/valid handshake is in fact a single transaction on the bus.

for test in TESTS:
    test(mapping)
0xdeadbeaa
0xbeaa
0xaa
0xdeadbeaa
0xbeaa
0xaa
image
vsergeev commented 3 years ago

Thanks a lot for testing @paul-demo! I've pushed a fix to the devel, which will be merged to master in the next release (and automatically close this issue).