FillRect could be accelerated

hglm commented 11 years ago

It appears the kernel G2D driver offers Fill Rectangle acceleration. This is currently not used by sunxifb. Implementing this, if it can work, should improve performance in X, especially with respect to lowering CPU utilization when large areas are filled.

I realize that software fill is probably not much slower, but it's the CPU utilization where gains can be made.

ssvb commented 11 years ago

One challenge here is that the performance of G2D is fillrate limited. It is currently clocked at 1/2 of the memory clock speed, which means 240MHz on cubieboard (which runs memory at 480MHz). The performance limit is one pixel per cycle (or 240 millions pixels per second total). This means that G2D can only utilize ~960MB/s of memory bandwidth for 32bpp and just 480MB/s for 16bpp. And for comparison, the CPU can easily use ~1400MB/s of memory bandwidth for fill operation. Primitive fill operation which does not involve a lot of computations and does not do many memory accesses per pixel is not the best workload for G2D. CPU is much faster than G2D for fills if we only consider wall clock time. You can also try https://github.com/ssvb/xf86-video-sunxifb/blob/master/test/sunxi_g2d_bench.c test program to get some numbers.

Another challenge is that G2D works only with a physically contiguous memory. The framebuffer is physically contiguous, but the offscreen pixmaps are allocated in normal cached memory (this makes a lot of sense when they are primarily accessed by just the CPU). So right now G2D can only potentially do fills, which are directly going to windows on screen (or to the root window). The practical use for it is probably not so significant. Maybe moving windows on top of a solid background and rendering the exposed parts of this solid background?

To sum it up: there are some drawbacks, it's not a clear win. But if you have some patches and benchmark results demonstrating practical usefulness of G2D fill, then they are very much welcome. That said, this issue still can/should be revisited when we have a better G2D support in the kernel.

hglm commented 11 years ago

I can see the drawbacks now. Fills may indeed be more commonly go to off-screen pixmaps. Still, at 1920x1080x32bpp, the fillrate difference between G2D and CPU using sunxi_g2d_bench is minimal:

A10, dram_clk = 408
Fill type:              G2D fill        pixman fill
1920x1080x32bpp         526 MB/s        548 MB/s
1280x720x32bpp          776 MB/s        1132 MB/s

So at 1920x1080x32bpp, there might be a benefit to using G2D fills. Especially in the sense of relieving the CPU and allowing background processes to continue while the fill takes place. I may try to experiment with this.

ssvb commented 11 years ago

_Still, at 1920x1080x32bpp, the fillrate difference between G2D and CPU using sunxi_g2dbench is minimal

There is actually one more interesting thing. The memory performance is not gradually decreasing, but gets abruptly changed at certain points. If your monitor can handle it, you can set 50Hz refresh rate to save some bandwidth (add "disp.screen0_output_mode=1920x1080p50" to the kernel command line). And then the next thing is to increase memory clock frequency a bit in u-boot. Going just from 408 to 432 for memory clock frequency improves CPU fill speed from ~552 MB/s to ~819 MB/s. The improvement for memory copy is not so dramatic, but also noticeable.

I suspect that it might be something like getting an extra cycle of penalty somewhere in the memory subsystem. So a minor change in screen refresh rate or memory clock frequency may change overall desktop responsiveness really a lot.

hglm commented 11 years ago

I noticed that too, I did some benchmarks with tinymembench that I posted on the wiki (Optimizing system performance) that show a similar drop-off at 1920x1080x32bpp with memory clock increasing from 360 to 408 MHz not helping much (while it helped a lot in lower resolution modes). I guess I should try running at 432 MHz.

ssvb commented 11 years ago

I sent a post in the mailing list with mostly the same information some time ago: https://groups.google.com/d/msg/linux-sunxi/0pGua9gzZTQ/VZN3jHo5Ss4J :) But it was really a good idea to put all the tips and tricks to http://linux-sunxi.org/Optimizing_system_performance Thanks!

hglm commented 11 years ago

I experimented with accelerated FillRect. 16-bit G2D is indeed much slower than software fill, the pixel fill-rate of G2D is keeping it down. However at 32-bit color, on a loaded system, there could really be some benefit. When there are background processes fully loading the CPU, the very small CPU utilization of G2D fill makes both the fill operation faster and the background process gets more CPU time (roughly double in case of a load of 1). At higher loads, the benefit should increase.

On an unloaded system, running x11perf shows:

                  rect10   rect100   rect500
G2D               987000   18300     775
G2D with FillRect 863000   12200     695

With kernel compile in background:

                  rect10   rect100   rect500
G2D               448000    8990     381
G2D with FillRect 459000   10700     640

Timing a single-threaded CPU benchmark with x11perf -rect500 running concurrently:

                         user  real
G2D                      20.2  40.6
G2D FillRect             21.1  22.4

ssvb commented 11 years ago

16-bit fill operations can be partially emulated using 32-bit fills, however we might need to additionally separately process 1-pixel wide leftmost and 1-picel wide rightmost columns in unaligned cases. Which makes up to 3 ioctls instead of 1, and introduces the hassle adding extra heuristics to decide when this optimization is beneficial or not.

ssvb commented 11 years ago

I'm still not totally convinced with x11perf alone and would like to see a more realistic use case, justifying the optimization of fills with G2D.

The current code is more like a placeholder. We need a better kernel driver to make G2D really useful for more advanced things.

ssvb commented 11 years ago

Also it would be a good idea to have a more real time conversation on #linux-sunxi irc :)

hglm commented 11 years ago

16-bit fill operations can be partially emulated using 32-bit fills, however we might need to additionally > separately process 1-pixel wide leftmost and 1-picel wide rightmost columns in unaligned cases. Which > makes up to 3 ioctls instead of 1, and introduces the hassle adding extra heuristics to decide when this > optimization is beneficial or not.

Interesting, I didn't realize this was possible. I might try it.

I'm still not totally convinced with x11perf alone and would like to see a more realistic use case, justifying the optimization of fills with G2D.

Yeah, x11perf is extreme and not typical of normal usage. However, on a loaded system, G2D FillRect should be beneficial in whatever way you look at it, reducing all kinds of kernel and CPU cache related penalties that come with running two CPU burning processes at the time. The actual amount of FillRects calls by applications may not be that high, at least not as influential as blitting for scrolling or dragging windows, but it is an optimization. I'll try to think of a way to construct a realistic use case, but there are not many good X benchmarks.

Maybe it is possible to use the current CPU load as an extra heuristic for chosing G2D or CPU FillRect. A bit far-fetched, but possible.

The current code is more like a placeholder. We need a better kernel driver to make G2D really useful for more advanced things.

I thought G2D can only blit and fill - what more advanced things are possible? The kernel driver is simple but it seems the kernel handles the sleep on IRQ fairly well.

hglm commented 11 years ago

I have released a patch for testing/evaluation. It is available on https//github.com/hglm/patches

1. Implement an area threshold for using G2D blits in sunxi_disp.c,
   currently set to 200.
2. In xCopyNToN, it is guaranteed that alu == GXcopy and planemask
   == FB_ALL_ONES, so no need to check for them.
3. Implement xPolyFillRect. Only the case of a drawable window, fill
   style of FillSolid, effective alu type of GXcopy, and plane mask
   of all ones is accelerated with G2D. Only uses G2D for fills of
   1000 pixels or larger in area (5000 for 16bpp). Two implementations
   of PolyFillRect are provided, one based on fbPolyFillRect and the
   other on exaPolyFillRect.
4. Add support for 16-bit fill to sunxi_disp.c and provide a 16-bit
   fill_in_three function to fill in three segments using 32-bit
   format for the middle segment.
5. Add the hook for xPolyFillRect to xCreateGC.
6. Add use_G2D flag to SunxiG2D struct that indicates whether G2D
   acceleration is enabled. This flag is used to determine whether
   G2D FillRect should be used.
7. Implement "double speed" 16bpp blits. When source and
   destination coordinates allow it, the blit is divided into up
   to three segments with the aligned middle segment being copied in
   32bpp mode.

ssvb commented 11 years ago

I thought G2D can only blit and fill - what more advanced things are possible?

Basically scaling, rotation, conversion between formats is supported. Alpha blending is a bit of a challenge, because there is unfortunately no direct premultiplied alpha support for doing it in one pass. You can check the documentation (Mixer Processor section): http://free-electrons.com/~maxime/pub/datasheet/A10%20User%20manual%20V1.20%2020120409.pdf

ssvb commented 11 years ago

I have released a patch for testing/evaluation. It is available on https//github.com/hglm/patches

Thanks. It would be best to actually fork xf86-video-sunxifb repository, create a git branch for your changes, and split the big patch into smaller logically independent parts. Some good practices are described here: https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#patch-series

Also I'm trying to organize the code in such a way, that it works on any hardware (using acceleration based on what features are available). For example, Allwinner A13 does not have G2D, so only NEON works there. Moreover, the same driver works on Samsung Exynos based ODROID-X board (with the support for Mali DRI2 GLES acceleration, but no layers or hardware cursor). And also nothing prevents it from running on x86 systems, where it works exactly in the same way as xf86-video-fbdev.

hglm commented 11 years ago

Thanks for the suggestion, I'll check out creating a git branch.

My code is not fully tested yet, and I've elminated a few bugs along the way. Splitting the patch into logical parts should make it easier to manage.

I also noticed that the driver is in principle device-independent, which I sort-of skipped over in my patch. It should be possible to make the G2D functions optional/generalized and compile the core driver without needing sunxi_disp.

rzk commented 11 years ago

Some offtopic about Exynos:

but no layers or hardware cursor

I believe in kernel 3.8 that became available for X2/U2 we can use the mainlined s5p-tv driver layering system. https://github.com/hardkernel/linux/blob/odroid-3.8.y/drivers/media/platform/s5p-tv/mixer_grp_layer.c#L234 https://github.com/hardkernel/linux/blob/odroid-3.8.y/drivers/media/platform/s5p-tv/mixer_vp_layer.c#L205

Even if the driver itself cant provide the layering, it uses the also mainlined videobuf2 v4l2 system that can report atleast something needed for direct rendering to the framebuffers. For example, @mdrjr from hardkernel inserted needed ioctls for UMP stuff to the vb2 framework.

ssvb commented 11 years ago

@rzk unfortunately there does not seem to be any public documentation about the layering system hardware in Exynos :(

It's probably more interesting to add some basic blit/fill acceleration for Raspberry Pi using DMA to fix "X11 struggles to get to 10fps just moving an unscaled, opaque, window" problem described in http://fooishbar.org/tell-me-about/wayland-on-raspberry-pi/ :) Or maybe just leave it alone and let them have their Wayland fun.

hglm commented 11 years ago

I've put up some patches at https//github.com/hglm/patches/sunxifb

The X driver FillRect patches are left out for now mainly because they are the only feature that requires changes to the device-independent structures so that sunxi_x_g2d.c would remain device independent.

I've seperated patches so that the only apply to either sunxi_disp or sunxi_x_g2d. For example, one patch extends the low level fill primitives in sunxi_disp.c, and another adds the double speed 16bpp blit. There's also a PutImage patch for sunxi_x_g2d.c that should be device independent.

ssvb commented 10 years ago

BTW, appears that this kind of acceleration could be actually useful for color key filling done in libvdpau-sunxi by @jemk Though it would be still preferable to avoid wasting memory bandwidth by eliminating this operation altogether.

ssvb / xf86-video-fbturbo

FillRect could be accelerated #8