raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.15k stars 1.68k forks source link

dispmanx stability / apparent 64 element limit #84

Closed hermanhermitage closed 8 years ago

hermanhermitage commented 12 years ago

Observed:

Latest firmware and foundation raspbian image.

Expected:

popcornmix commented 12 years ago

I've had a look. The limit looks like limit is 127 to me. When that is reached DISPMANX_INVALID = -1 is returned.

What I'm finding is that when all dispmanx elements are alloced (on GPU), the element_remove call fails (because it allocates an element when removing...) and it's not possible to remove the existing elements.

I'll report this.

hermanhermitage commented 12 years ago

Confirmed return of -1 after element list/table is full (128 items).

However after element list/table has about 64+ items, when i remove the items they remain. So for more than 64 items added, the system becomes unreliable (added element not removable).

Additional Q's:

The main purpose behind this madness is for making some simple demonstration programs for the younger generation who may have missed character mapped, tiled, raster and sprite systems. eg. think a simple 2d game engine without any need for rasterization routines.

popcornmix commented 12 years ago

(I've got a fix for original bug reported - will be in next start.elf pushed out)

The 128 element limit is purely a define in software. However the HVS (hardware video scaler) is running in a "on-the-fly" composition mode, and you will find when the amount of data required to be fetched on a scanline gets too high, you will break it. (Artifacts/loss of sync caused by fifo underrun). All elements intersecting on the same horizontal scan line is the worst case. It is also worse when resizing down (i.e. more source data needs to be read per destination line).

How many elements do you want? I could make a test build with a higher number set.

You may be better off rendering to an offscreen buffer, and adding the offscreen buffer to the display. I believe you have the API for that. That removes the "on-the-fly" realtime constraint and should work with more elements.

Palettised is supported by compoistion hardware (you can fbset -depth 8). It's not widely used, so you may not have the API to set palettes per resource.

I don't think you can adjust scaling per scanline in hardware. Certainly not thorough any exposed API.

hermanhermitage commented 12 years ago

Yes rendering offscreen is possible, but am interested in pushing the compositor to the limit for a bit of fun.

A caveat-emptor mode (where the user adheres to the scan line limits) allowing 1024-4096+ is where it gets interesting for simulating character/tiled mapped displays.

(Depending on how much overhead there is in updating element positions and associated resource).

hermanhermitage commented 12 years ago

So... the 1024+ request was serious. Would like to test it if possible. Doing a simple "raspberry pi" recipe on how to use dispmanx for 2d style video games - will post to forum & github.

popcornmix commented 12 years ago

Okay, not tested, but the #define is increased to 1024. https://dl.dropbox.com/u/3669512/temp/start_dispmanx1024.elf

This also has a fix from the dispmanx guy for the original problem reported. Although it fixed the problem reported I think something else was unhappy when the allocation failed. You may avoid the problem with the higher number of dispmanx elements in this build.

hermanhermitage commented 12 years ago

Ok thanks for that. Tested. Up to 128 elements works ok. Beyond that I think there might be a software or hardware issue. Definitely a software issue, because after I remove elements they can sometimes remain - even after my process as ended.

With more than 128 elements (I tested 255 and 511 - because the fb is taking 1) dispmanx reports adding the additional elements without error, but only displays the first 128.

On removing all elements, elements 1 to 128 disappear the 129th to 256th elements appear to take their place (even though dispmanx reported removing them ok). This is similar to what was happening with the old version after adding the 65th element.

ie. Add elements e1, ..., e255 (all return codes ok). Elements e1...e128 appear. Remove elements e1, ..., e255 (all return codes ok). Elements e129...e255 appear.

I experimented with layer priorities in case there was a collision/hashing problem in the dispmanx VideoCore software.

The scene setup is 100x100 pixel elements (all sharing the same bitmap), space 10 pixels apart. So 32 elements per row across the screen. No overlap. No alpha.

I suspect there is another "128" entry limit not attached to the define you changed.

Thanks.

popcornmix commented 12 years ago

I've updated start_dispmanx1024.elf with a possible fix for 128 limit.

hermanhermitage commented 12 years ago

Ok thanks for that. Just tested it. Still only displays first 128 - but the good news is the extra ones dont pop up. They get a valid id returned, but never appear.

I wonder if there is a 128 intrinsic limit elsewhere... either by accident or design... accidental signed char somewhere?

popcornmix commented 12 years ago

If you can make a test program available to me I can dig in to what's happening. (unfortunately the guy responsibel for dispmanx left last Friday...)

hermanhermitage commented 12 years ago

Thanks. Check out https://gist.github.com/3660570 (Replacement for dispmanx.c in the /opt/vc/src/...) Edit: Also interested if the magnification filter can be set (seems to use a bilinear by default, is there a point sampled version?). And dispmanx limitations aside, is there a list of color formats supported by the hardware scaler/compositor? Edit: Ok I have 4bpp and 8bpp working, I guess there is one shared palette? (currently set by the /dev/fb stuff in linux).

popcornmix commented 12 years ago

I did fix some more limits in the code (in the actual HVS driver, rather than dispmanx). I could get 200 elements working, but with 250 I was running out of memory in the HVS itself (the HVS has its own block of RAM for control lists). The elements in the control list are variables sized (e.g. unity scaling is smaller that scaled), so it's not immediately obvious what the limit is, but it seems 128 has been set as a safe limit.

Now I think this memory is partitioned per display (e.g. TV, LCD, and offscreen), so it is possible we can reclaim the LCD part, but I'll have to talk to someone about that. My guess is that 1024 may be impossible.

However the GPU is quite capable of efficient sprite blits, including transparency/alpha and even arbitrary rotation/scaling. However we are missing a nice API. It is a possibility to add that.

hermanhermitage commented 12 years ago

Ok, interesting. Depending on the partitioning it sounds like up to 500-600 might be possible? So the hardware supports driving three independent outputs simultaneously?

The 2d could possibly be very useful for X11/EXA.

popcornmix commented 12 years ago

The hardware has 3 channels, normally an LCD a TV (either HDMI or composite, not both) and a memory channel (for offscreen composition, or transposing). We have driven additional LCDs that are non-real time (e.g. that have their own framestore, through SMI).

We'd love to use the GPU for X acceleration, but standard X requires very frequency syncs which means latency is the most important thing, and so not ideal for offloading to a distant processor.

Suppose GPU can do an arbitrary amount of work for the ARM, but it always takes 1ms. Will that make X go faster than the ARM doing it?

hermanhermitage commented 12 years ago

How distant? The fastest hardware designs I know would have the guest writing commands/cache-line-size bursts out directly to a queue on the host. The queue is in turn mapped as a GPU register bank (sliding window). So a command dispatch can be fewer than 10 cycles away.

1ms sounds like a message send to a software thread, might work for a coarse grained scene render call but not for fine grained primitives.

Is a fast path into the VideoCore possible?

(Edit: Not thinking of using tiled based 3d hardware because with morton order (or similar) texture formats and deep pipeline its more suited to large batches only).

popcornmix commented 12 years ago

The blitting would be done by the vector core of the CPU (which can't be controlled by the ARM) so you are talking about the cost of a mailbox message, an interrupt on the GPU and the ARM and whatever task switches they invoke.

The mailbox/property interface takes about 350us to send a set_clock_rate message and get a response back. I believe the clock changing is a small fraction of that time. I believe the majority of the time is taken by ARM side context switches.

The GPU is only going to be useful if latency can be tolerated.

hermanhermitage commented 12 years ago

Well I'm confident latency is one to two orders of magnitude over what is achievable with a different set of objectives than those that resulted in the current design.

The usual way of tolerating latency in the Driver should work - batching, parallel issue of primitives and relying on a low frequency of sync operations (requiring a round trip).

As a first step until the mail boxing latency is fixed, I'd set up a ring buffer in L2/SDRAM and only use mailbox messages to indicate a few conditions (buffer empty/idle, buffer now half empty, buffer full, sync completed). Spinning and busy waiting on the ring buffer pointers as appropriate to trade cycles for better utilization and graphic primitive throughput.

I guess it comes down to how quickly the GPU can respond. Some contemporary designs have a dedicated hardware thread for handling the sync/async command queues from the host. There they pretty much have a loop:

next:
  sleep
  mov r0, Q_COMMAND_WORD ; read command
  tbw r0
...

fillrect: ;(r0:x, r1:y, r2:width, r3:height, r4:color)
  load-multiple r0-r5, Q_COMMAND_PARAMS
  ...fill rectangle code...
  bra next 

Ideally the ringbuffer would be mapped onto a shared fifo or dual ported memory as using L2/SDRAM for this is just so wrong! :)

I think its eminently possible but because of the architecture of X11 its pretty much an all or nothing proposition.

hermanhermitage commented 12 years ago

Of course the madman in me says you could always run a full XServer on VideoCore and thus expose X11 acceleration to all operating systems... could be a reasonable project for an intern.

hermanhermitage commented 12 years ago

(Let me move latency to a separate issue - I have to dig into it myself as I dont have a handle on the linux side). Still open:

I wish I could help, if only it was all open source :-)

Thanks for your help, I will publish a dispmanx tutorial for beginners shortly.

popcornmix commented 12 years ago

When a userland process dies, all VCHIQ services receive a close message (on GPU) which allows them to release resources. Generally an exit without freeing resources, or a control-c or a seg fault doesn't leak resources.

Send me a program that doesn't shut down and I'll look into it.

The HVS supports nearest neighbour sampling. Not sure if that API is exposed. There HVS can support more that one palette, but it takes quite a bit of space in the limited HVS memory. I think the software only supports 1 palette currently. Not sure what you mean with "iterating current layers"?

hermanhermitage commented 12 years ago

Iterating... sorry should say "Enumerating". Ie discovering the layers of other processes. Most likely by design, as dispmanx is the lowest layer and so it would make sense to write a window manager on top of it which other processes call, rather than having them call dispmanx directly.

popcornmix commented 11 years ago

I've pushed a new firmware that makes dispmanx more robust to too many elements issues. It also adds an option dispmanx_offline=1 (to config.txt) that allocates an offscreen buffer, and automatically switches to it when complexity is too high to handle "on-the-fly" composition. This should allow more graceful degredation when you get too much complexity on a given scanline, or you exhaust the HVS's context memory.

The official firmware still has 128 elements as the limit, but for you I've built a 1024 element version. https://dl.dropbox.com/u/3669512/temp/start_dispmanx.elf

Note: you are limited to 1024 elements operations in an update, so you cannot remove all 1024 elements and re-add them in one update (you do this to 512 of them). However using vc_dispmanx_element_change_attributes should allow you to update all elements.

(and I think due to how the linked lists work, one of your elements is unavailable, so you may be limited to one less than you expect).

hermanhermitage commented 11 years ago

Thanks.

I tested the 1024 element version. It seems the time to perform an update (of even a single surface) depends on the total number of elements in the display list according to roughly:

time = 0.5ms * totalElementCount

I timed the following:

Total Element, Time to Update (seconds) 26, 0.016 52, 0.033 100, 0.050 200, 0.100 300, 0.150 400, 0.200 500, 0.250 600, 0.300 700, 0.350 800, 0.417 900, 0.467 1000, 0.534

I was testing by just updating opacity/transparency of the surfaces. I tried updating both all surfaces or just a handful (slightly lower latency).

I'm guessing vc_dispmanx_update_submit_sync() is causing a rebuild of an internal display list (like a copper list update) and for some reason this is O(n) with a high constant of 0.5ms per element.

That puts the kibosh on using the current dispmanx implementation for dynamically moving about large number of surfaces. Not sure if this will be an issue with 60Hz Wayland implementation with more than 26 surfaces on the screen.

One anomaly - during the "900'" element test, I hit Ctrl-C and the application hung (waiting on the VideoCore side?) and never continued.

popcornmix commented 11 years ago

Are these still the 50x50 squares from your example, or something much larger?

The HVS takes somewhere between 1 to 4 pixels per cycle (depending on format/scaling).

Note: if Wayland has 26 full screen surfaces on screen, it is expected to remove completely occluded (and possibly subdivide and remove partially occluded) surfaces. dispmanx/hvs doesn't do that automatically.

hermanhermitage commented 11 years ago

20x20. So it's not the per pixel overhead just the list management I was looking at. Was just running:

  1. Begin update
  2. change attributes
  3. end update
  4. Goto 1

26 is still more than useful for demonstrating scrolling surfaces, parallax and sprites. Was just surprised at the overhead - but then I'm sure it's been mainly used for subtitles and DVD menus and UI etc to date. So the use case I was playing with is off...

popcornmix commented 11 years ago

It's probably down to the dispmanx calls being synchronous (due to returning values). I think if I added a: void vc_dispmanx_element_change_attributes_async() call, that didn't wait for a response, it would be substantially quicker.

hermanhermitage commented 11 years ago

From my measurements at least 75% of the overhead is due on the videocore side rebuilding the list.

ie with 1000 display elements, if i update 10, or i update 1000 there is only 25% change. So I think the bulk of the overhead is a display list rebuild.

eg. with 1000 elements, doing an update with 10 changes takes 0.4s, versus 0.5s updating all 1000. Thats 12 API calls versus 1002.

vc_dispmanx_update_start(..) vc_dispmanx_element_change_attributes(...) ...n times... vc_dispmanx_updatesubmitsync

The other way to kick overhead (if it was the API which I dont think it is), would be a vc_dispmanx_element_change_attributesv() taking an array of updates.

Ruffio commented 9 years ago

@hermanhermitage is this still an issue?

hermanhermitage commented 9 years ago

Good question. I will have to dust off my test program and try it again and then close the issue!

Ruffio commented 8 years ago

@hermanhermitage have you had the time to test it?

Ruffio commented 8 years ago

@popcornmix this issue seems to be stalled...

popcornmix commented 8 years ago

Closing. Feel free to reopen if you have time for more testing.