raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.02k stars 4.95k forks source link

OpenVG / videocore memory corruption #943

Closed gitbf closed 7 years ago

gitbf commented 9 years ago

Ever since I got my hand on the first R-Pi released there have been issues with VC memory being corrupted when running OpenVG applications for a long time (hours/days). The application I run start out with allocating a font (256 glyphs) and corruption becomes apparent when after some time, a few glyphs become corrupted ("funny" characters appear on the display).

I have one Raspbian kernel however (nightly build 40185a95ac04ffcec406d9e1ef934406d7221939) from a few weeks back (I'm using this with a R-Pi 2) that is dead solid. My theory is that the bug is caused by an uninitialized variable related to VC garbage collection and that this kernel incidentally got it right.

I would love to see someone take on the challenge to eradicate this long-term bug as it is a show-stopper (in terms of using OpenVG) for any non-trivial project.

See the last few posts in this thread for more detail: https://www.raspberrypi.org/forums/viewtopic.php?f=69&t=96899

gitbf commented 9 years ago

I just tried a clean install of the 2015-05-05 Raspbian and this version suffers from the same VC memory corruption bug as described above.

Is there anyone more at home with VC debugging than me, willing to share a few pointers? Nature of the bug suggests it is related to core VC memory management. What sources would be candidates to look at in this case?

popcornmix commented 9 years ago

Unfortunately the people who wrote the OpenVG server are gone. As OpenVG is not widely used it's lower priority than, say, openGL or video.

We do have access to the source code, so if there is a straightforward fix, then it may be possible. If you could find a trivial test app that fails in a short time (e.g. just doing the problematic operation in a tight loop) then it's more likely we could fix it.

As far as debugging goes, if you add "start_debug=1" to config.txt and run the build will include assert logging. This may generate messages in debug log (sudo vcdbg log assert) that if we are lucky may narrow down the problem.

gitbf commented 9 years ago

Thanks for commenting!

Compliance testing for OpenVG is quite strict (I have seen no issues with the API as such), so it may be that the issue is more generic and applies to other API’s as well. As for the user base, this API is supported by QT (which is reasonably popular) and has the same issue when run on the Pi. This is also the only HW accelerated (and quite sophisticated if I may) 2D API available and makes for an excellent scholar introduction to computer graphics.

I believe the bug may have a trivial fix (finding it however is not). Some nightly kernel builds (specifically: sudo apt-get 40185a95ac04ffcec406d9e1ef934406d7221939) works with no issues (one theory is an uninitialized variable that incidentally got the right value for this build).

The fabric of a test application would involve creating a number of OpenVG objects (such as font glyphs) during program initialization and then verify the integrity of these objects against source after a series of dynamic operations (such as create/destroy path). I am short on ideas however when it comes to verifying the objects (as they exists in VC memory) beyond visual inspection of characters as they appear on screen. A more indirect approach is to check for errors (vgGetError) after every API call, as this is likely to trigger once memory is corrupted. Going from an error return to “why” and “where” however may be tricky.

I notice a number of threads get started when launching OpenVG. Names are “VCHIQ completion”, “HDispmanx Notif”, “HTV Notify” and “HCEC Notify”. Is there VC “garbage collector / memory manager” running in user space or is memory management VC domain only? In short, I am struggling to get a handle on debugging this and even when errors materialize, it doesn’t help me towards solving the bug without understanding the full architecture and having access to relevant sources.

popcornmix commented 9 years ago

The openvg code on the arm side (/opt/vc/lib) is a pretty thin layer that remotes function calls to the GPU. It's unlikely the bug lies there. Memory allocations occur on the gpu. You can see the allocation with "sudo vcdbg reloc" and "sudo vcdbg malloc". The reloc is likely to be more interesting. Note the vcdbg commands read GPU memory in an unsafe way. Running: vcgencmd cache_flush && sudo vcdbg reloc is a little safer, but you may still get spurious results if GPU is busy. If you get the same output twice in a fow then it's likely to be valid. Memory corruption may be reported here (if guard words are trampled), but only believe it if you get it reported repeatedly (you may get a spurious corrupt result if an alloc/free occurs whilst the arm is walking the heap).

gitbf commented 8 years ago

I’ve been running Jessie Lite (2015-11-21) now since it was released. The application I’ve been using for longevity test (run-forever is the goal) has a 10Hz graphics refresh rate with a combination of text and simple graphics (lines, squares, circles, … instrument dials). Unfortunately, issues are the same (as Wheezy) with corrupted glyphs after leaving it running for a few days (or sometimes a few hours).

Regular checking with “vcgencmd cache_flush && sudo vcdbg reloc stats” is pretty consistent (fairly high reloc/alloc activity), but always plenty of space left (compaction counts remain zero). ARM memory footprint of the application is stable at around 1% of total.

Glyphs get created at program start using the “vgSetGlyphToPath” function. They remain static for the lifetime of the running program. Paths get destroyed (vgDestroyPath) right after vgSetGlyphToPath in accordance with OpenVG 1.1 recommendations:

“Applications are responsible for destroying path or image objects they have assigned as font glyphs. It is recommended that applications destroy the path or image using vgDestroyPath or vgDestroyImage immediately after setting the object as a glyph.”

For lack of better ideas, I tried compiling a version without destroying paths and this turned out to make a difference. I’ve not observed glyph corruptions with vgDestroyPath (as used for glyphs) commented out and it will come back when included. So it appears there may be an issue with memory allocation / reference counting related to the vgSetGlyphToPath function.

Any ideas on how to debug this?

gitbf commented 8 years ago

Using "vcdbg reloc small", I was able to check individual GPU memory fragments allocated for glyphs and it appears that reference counting is correct. That is GPU memory buffers get incremented/decremented as expected - so back to start again.

There are still no signs of glyph corruptions (with the hack described in the previous thread enabled), but I've since observed a few GPU dead-locks. My application is multi-threaded, but OpenVG is only used in a single thread so a GPU dead-lock is not expected. Stack dump shows the following sequence leading up to the dead-lock:

eglSwapBuffers() _new_sem_wait() do_futex_wait() -- dead locks here

Looking at other running threads, I can see that ILCS_HOST, HCEC_Notify, HTV Notify and HDispmanx Notif are all waiting in a call to do_futex_wait(). VCHIQ completion is waiting in a call to select(). Commands "vcgencmd cache_flush && sudo vcdbg reloc stats" suggest I have plenty of GPU memory available.

Still at loss and don't quite know where to go from here. Anyone feeling inspired to chip in with knowledge that can help narrow down this GPU memory corruption/lock issue?

pelwell commented 8 years ago

To get any kind of traction you'll need to provide a test application, and accompanying instructions, that demonstrates the problem in as little time as possible. If we know that running something overnight is almost guaranteed to show the problem then it may get some attention, but bear in mind that it will be competing with issues which affect more people and require less dedication to investigate.

SamuelBrucksch commented 8 years ago

Anything new here? Looks like other users have random crashes as well and it seems to be the same problem. I use the ajstarks lib as well...

gitbf commented 8 years ago

"Anything new here?"

Not much I'm afraid - the error (GPU dead-lock and/or memory corruption) is still there in the latest Raspbian release build (Jessie lite May 27th, 2016).

It really is a shame as it severely limits usage of the Pi beyond the academic scene.

6by9 commented 8 years ago

https://github.com/raspberrypi/linux/issues/943#issuecomment-183483302

To get any kind of traction you'll need to provide a test application, and accompanying instructions, that demonstrates the problem in as little time as possible.

No sign of that, therefore there will have been little (if any) investigation will have been done within the firmware.

SamuelBrucksch commented 8 years ago

I can provide my OSD code that causes the problem. Isdue with that only is, that sometines ut happens within 30mins or so and sometimes after hours only, so there really is no quick way to show it.

gitbf commented 8 years ago

Thought I would share an example image that shows GPU memory corruption. In this example, the 's' character has been corrupted.

The program this is taken from creates 256 vector characters at startup and these remain static for the lifetime of the application. A number of other objects/shapes however are constantly allocated/destroyed. In OpenVG, vector fonts are created from path segments converted to glyphs. Rendering and memory management is all within the GPU and so an issue like this should not happen unless there is a BUG in the underlying library (possibly code running on the GPU).

As explained in posts above, this may happen after leaving the application running for a couple hours, a few days or even weeks. This makes it very hard to debug as there is no (known) way to create sample that will fail in a predictable/useful manner.

Knowing there are some smart people out there with much more knowledge of the PI HW than myself – what other options exist for debugging this issue?

gpu-bug-clip

6by9 commented 8 years ago

Pictures don't help - there's no way to debug a picture.

@SamuelBrucksch if you want to share your some simple code with build instructions, then please do.

SamuelBrucksch commented 8 years ago

You can find it here including installation instructions: https://github.com/SamuelBrucksch/wifibroadcast_osd

6by9 commented 8 years ago

Thank you. I will try to find time to set up a test rig and do a first level investigation.

SamuelBrucksch commented 8 years ago

People reported the more graphical elemets there are, the higher the chances for a freeze are. You can enable more elements in den osdconfig.h. I will upload a telemetry file later, that you can feed in via stdin so you can actually see something.

SamuelBrucksch commented 8 years ago

I just checked and there is already a telemetry dump. So what you have to do is select LTM in the lower section of the osdconfig.h and then once you built the osd run this command:

while true; do cat raw_dump.txt; done | ./osd

Ruffio commented 8 years ago

@gitbf has your issue been resolved? If yes, then please close this issue.

gitbf commented 8 years ago

No, the issue remains unresolved. To the best of my knowledge this VC bug affects:

popcornmix commented 8 years ago

while true; do cat raw_dump.txt; done | ./osd

How long does it typically run before getting a problem?

gitbf commented 8 years ago

"How long does it typically run before getting a problem?"

That's part of the problem - there is nothing typical (hours/days).

Sometimes the application will deadlock:

eglSwapBuffers() _new_sem_wait() do_futex_wait() -- dead locks here

... other times it's visual (corrupted glyphs):

image

popcornmix commented 8 years ago

Can you try this test firmware: https://dl.dropboxusercontent.com/u/3669512/temp/firmware_vg.zip

gitbf commented 8 years ago

Certainly!

I now have a test setup running with the firmware - what should I expect?

popcornmix commented 8 years ago

Well I'm hoping no more corruption and hanging.

gitbf commented 8 years ago

Ok, let's see!

@SamuelBrucksch Will you try it as well?

SamuelBrucksch commented 8 years ago

Sure but i cant try it within the next two weeks. However some friends use my OSD and i will tell them that there might be a solution so i think they will try it.

gitbf commented 8 years ago

Looking good so far. A high-load test has been running now for 4+ days without a glitch.

This makes a difference. A rig is prepared this weekend that will feature on two exhibitions in September (5 Pi's will display instrumentation on 15 inch monitors and a 6th will display a CCTV LAN feed).

@popcornmix thanks for your support!

popcornmix commented 8 years ago

Cool. The current fix is not quite in the right place, but we do know what is wrong and where we want to fix it. The test firmware (and latest rpi-update firmware which includes the current fix) should make your code reliable, but it could theoretically occur elsewhere so we'd like to fix it at source.

I'll leave the issue open and will ping this issue when there is a final fix.

rodizio1 commented 8 years ago

I have also tested the firmware some days now with Samuel's OSD code, seems stable. Thanks a lot.

However, I noticed something different: Somehow it's possible to "overload" the GPU, when the OSD is running and I start another modified hello_font.bin process, the HDMI output seems to stop completely for a second. Monitor shows "no input signal" then.

It seems to be dependent on "load", it doesn't happen anymore when GPU is overclocked. Or alternatively adding a usleep line to the OSD code also helps.

popcornmix commented 8 years ago

Read https://github.com/raspberrypi/firmware/issues/407 basically there is a limit to the number/complexity of overlays that can be composited in realtime. dispmanx_offline=1 switches to a non-realtime mode, but performance is reduced.

When debugging this issue, I was running 4 instances of your osd application. That caused the HVS output to underflow (and so cause no signal). I used overclocking to avoid that:

core_freq=500
sdram_freq=550
over_voltage=4

which stabilised things, but overclocking is not guaranteed to work on all Pi's.

rodizio1 commented 8 years ago

Thanks. After some optimizations (mainly adding usleep() lines to the loops that do the rendering) it seems to work stable now, even with standard clock settings.

Were the problems I had mainly from different applications trying to use the GPU at the same time or is it "load" in general? In your testing, you needed 4 OSD processes to make the HDMI output stop. I assume if one would use a single application that draws 4 times as much, the problem wouldn't occur so easily?

The command to show the layers in use (from the thread you linked) gives:

root@wifibroadcast(rw):~# vcgencmd dispmanx_list
display:2 format:RGB565 transform:0 layer:-127 src:0,0,1920,1080 dst:0,0,1920,1080 cost:889 lbm:0
display:2 format:YUV_UV transform:0 layer:0 src:0,0,1280,720 dst:0,0,1920,1080 cost:1216 lbm:20480
display:2 format:RGBA32 transform:20000 layer:1 src:0,0,1920,1080 dst:0,0,1920,1080 cost:1156 lbm:0
display:2 format:RGBA32 transform:20000 layer:2 src:0,0,1920,1080 dst:0,0,1920,1080 cost:1156 lbm:0

Layer -127 seems to be the framebuffer console, 0 is hello_video.bin decoding/displaying a 720p 5mbit h264 stream, 1 is Samuel's OSD, 2 is my modified hello_font.bin.

popcornmix commented 8 years ago

Makes no difference if it's a single app or several.

Just down to the number and cost of the layers. Larger source layers have higher cost (more data to fetch). Vertically resized layers have more cost (context memory).

If you are not using the default framebuffer console then disabling or making it smaller is beneficial.

rodizio1 commented 8 years ago

This is weird. It seems to happen again now (with standard clock settings) My modified hello_font.bin just displays a line of text fading in and then fading out again:

  int a;
   for( a = 15; a <= 255; a = a + 30 ){
      graphics_resource_fill(img, 0, 0, width, height, GRAPHICS_RGBA32(0,0,0,0x00));
      render_subtitle(img, text, 0, text_size,  y_offset + offset, a);
      graphics_update_displayed_resource(img, 0, 0, 0, 0);
    usleep(1e5);
   }

   usleep(1e6);

   for( a = 255; a >= 15; a = a - 15 ){
      graphics_resource_fill(img, 0, 0, width, height, GRAPHICS_RGBA32(0,0,0,0x00));
      render_subtitle(img, text, 0, text_size,  y_offset + offset, a);
      graphics_update_displayed_resource(img, 0, 0, 0, 0);
    usleep(1e5);
   }

But this really is "too much"? Samuel's OSD is just painting some lines and characters on the screen, I thought these GPUs are much more capable considering that there are much more complex 3D games being run on them? Or is the hello_video.bin already causing so much load?

I also tried disabling the console, I think it helped (but need more testing).

popcornmix commented 8 years ago

Nothing to do with the complexity of each layer - just the number and size. Four 1080p layers is too much to fetch and composite in real time. Either reduce the number of layers or switch to offline composition (dispmanx_offline=1)

pelwell commented 8 years ago

Layers are composed of pixels, not 3D primitives. Fetching a blank screen takes as much time as a complex image.

rodizio1 commented 8 years ago

Ah, okay, thanks.

I guess that means the best thing would be to move the functionality of my hacked hello_font.bin into Samuel's OSD so that there is no additional application that creates an additional layer. Hmm, let's see if my non-existant C skills will allow for that ;)

rodizio1 commented 8 years ago

Hmm, now I just had a display freeze again. hello_video.bin and Samuel's OSD was running, but this time it was hello_video.bin that was in [defunct] state. Running vcgencmd also doesn't work anymore, it just sits there and ctrl-c does nothing.

But the system is still working apart from that, can I collect any other infos you may need?

rodizio1 commented 8 years ago

OSD crashed gain, after just 10 mins or so. But this time only the OSD froze. hello_video.bin is still displaying the video stream. Could it be that it is somehow related to the number of wifi sticks I'm using? USB or interrupt load or something? The last two times it crashed now was with 4 and 5 wifi sticks for receiving.

rodizio1 commented 8 years ago

Now I sometimes see wrong renderings again. Just happens sometimes, not sure how often. Looks like it draws a big red triangle on the screen for a very short amount of time (one rendering probably, it draws every 50ms). Probably coming from the battery status element as that is the only red thing in the OSD.

Difference is, that it's not permanent. Before, with the older firmware when the drawings got corrupted it looked very similar to the pic that gitbf posted and was permanent.

rodizio1 commented 8 years ago

Just booted up the system again, right before it would normally display the video image and OSD, the HDMI signal got interrupted again shortly. I can see, that:

408 tty2 Sl+ 0:00 /usr/bin/vcgencmd get_camera 409 tty2 S+ 0:00 grep -c detected=1

is running (or hanging) permanently. (a line in my script that checks for the cam to determine if the Pi is in the transmitter or the receiver role).

Tried running vcgencmd get_camera again manually already expecting that it would just sit and hang there, but it's still working.

tty2, the terminal that the OSD would run on shows "vchi_msg_dequeue -> -1(22)"

Edit: when I remove that vcgencmd get_camera line from my script it works, tried 20 times in a row. When inserting the line again, all kind of weird things happen, but not all the time, only about every 2nd or 3rd bootup. Sometimes the OSD doesn't start, sometimes it does start, but the screen goes black again after a few seconds. Just for clarity, the vcgencmd does not run concurrently with any other process that uses the GPU like hello_video or the OSD.

rodizio1 commented 8 years ago

Have added "tvservice -o && tvservice -p" to a startup script to make sure it's not that layer thing, but it's still flaky, just bootet up, everything looks good, like after 30 seconds or so, the HDMI output suddenly drops again shortly :(

popcornmix commented 8 years ago

@gitbf I have now pushed a better fix for the issue. Available with rpi-update. Can you update to that and confirm the issue is resolved.

gitbf commented 8 years ago

Thanks!

Just upgraded the test setup which by the way had been running 18+ days without a glitch prior to reboot.

lucasvl commented 7 years ago

thanks! I had the same problem and this solves it.

popcornmix commented 7 years ago

@gitbf okay to close?

gitbf commented 7 years ago

Yes, indeed.

Thanks for sorting this one out - case closed!