Accelerated memcpy - Githubissues

digetx commented 5 years ago

Add accelerated memcpy that uses VFP instructions to copy 128 bytes at once, it gives significant boost to copying from uncached buffer. Inspired by results of https://github.com/ssvb/tinymembench.

========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. ==

VFP copy (from framebuffer) : 436.1 MB/s (1.7%) VFP 2-pass copy (from framebuffer) : 392.8 MB/s (1.3%) ARM copy (from framebuffer) : 285.8 MB/s (1.8%) ARM 2-pass copy (from framebuffer) : 274.5 MB/s (1.2%) standard memcpy (from framebuffer) : 86.3 MB/s (0.5%)

digetx commented 5 years ago

@kusma commits are updated, please take a look. If it looks okay now, then I can merge this all sometime later after some more testing. I made some extra changes..

1) now copying won't be broken in 4K chunks and will be a single-large "VFP" copy when possible 2) decided to keep memcpy() for cases of copying from cacheable memory because I got a bit different result this time during testing and memcpy could be ~20% faster for specific cases of cacheline aliasing 3) changed the rule of when to use bounce buffer, it will be used only when both src and dest are uncached, perf tests showed that it is more optimal 4) src address is now always aligned, only dst can be bounced.. perf tests showed that it is more optimal

digetx commented 5 years ago

@kusma Patches updated.

Dropped "offset aligning" copying, we can return to it later once there will be a real use-case (accelerated DownloadFromScreen for example).

Tuned copying performance, there is now tegra_copy_block_vfp_arm() which is a mix of VFP + STMIA, it should be faster than generic memcpy() in all cases of copying sizes.. etc.

BLOCK_SIZE reduced from 4K to 1K, this gives some boost for 2-pass copying.

vfpcpy() now doesn't clobber D8-D15 VFP register and copies 64B at a time since I can't see perf difference in comparison to 128B.

digetx commented 5 years ago

Added __thread annotation to bounce_buf. It is not really necessary for a single-threaded Xorg, but now those vfp-copy functions should be thread-safe.

digetx commented 5 years ago

@kusma Re-added tegra_memcpy_vfp_unaligned() and added EXA DownloadFromScreen implentation.

digetx commented 5 years ago

@kusma Any more comments? I've been testing (using) opentegra with these patches included for a week without any problems.

grate-driver / xf86-video-opentegra

Accelerated memcpy #42