Repeatedly using EGL freezes VideoCore

yoth commented 11 years ago

I created an application that uses EGL and I wondered why it freezes. The problem lies with EGL. It can be reproduced with the following code:

int main(int argc, char *argv[])
{
   EGLDisplay   display_ = EGL_NO_DISPLAY;

   bcm_host_init();

   display_ = eglGetDisplay(EGL_DEFAULT_DISPLAY);
   eglInitialize(display_, NULL, NULL);

   eglTerminate(display_);
   display_ = EGL_NO_DISPLAY;

   bcm_host_deinit();

   return 0;
}

I just call eglInitialize and eglTerminate repeatedly. EGL itself works properly but when I start that application around 5 to 100 times it freezes randomly at startup. The process cannot be killed (even with kill -9) and never returns.

Debugging shows that it hangs at:

#0  0xb6d17b5c in ioctl () from /lib/libc.so.6
#1  0xb6f41908 in create_service.constprop.4 () from /usr/lib/libvchiq_arm.so
#2  0xb6f4404c in vchi_service_open () from /usr/lib/libvchiq_arm.so
#3  0xb6f53c84 in vc_vchi_gencmd_init () from /usr/lib/libbcm_host.so
#4  0xb6f5256c in bcm_host_init () from /usr/lib/libbcm_host.so
#5  0x00008660 in main (argc=, argv=) at egl_bug.cpp:17

popcornmix commented 11 years ago

I can reproduce your problem. I'm guessing there is some leak occuring on the GPU, meaning that eventually egl fails to start. (In fact all of vchiq is stuck). I'll attach a debugger when I get the chance and see if I can work out what's going on.

rec commented 11 years ago

Hello!

We're the pi3d project https://github.com/tipam/pi3d, and this seems also to be happening to us, fairly repeatably. If you needed a reproducible test case we could probably come up with one - in Python though...

We have a strong interest in getting this working - a lot of people are interested in kiosk-type applications where the program runs for a very long time, and this seems to be, as of now, a stumbling block to that.

Thanks in advance!

cleverca22 commented 11 years ago

it would help a lot to have a kernel backtrace, its easy to do

first, reproduce the problem, then run this command as root

echo L > /proc/sysrq-trigger

then you should have a kernel backtrace of EVERY process in dmesg, but it might overflow

echo W > /proc/sysrq-trigger

this one will only show the backtrace for blocked tasks, but i'm not sure if vhciq counts as blocking io

just dig thru dmesg for the pid for your hung process, and cut that section out, and paste it in here

rec commented 11 years ago

Thank you - we'll be on this in the next day or two.

paddywwoof commented 11 years ago

cleverca22, I have done what you suggest but running as root echo L > /proc/sysrq-trigger doesn't seem to generate useful info, just: SysRq : HELP : lgoleve(0-9) reBoot Crash terminate-all-tasks(E)... etc same feeding in a W. If I try other keys I can get the same message followed by neverending new prompts that have to be stopped by ^C

Probably not useful but what I did find is that when I modified a program in geany in x to create and destroy the elg surface repeatedly it went wrong after 50 to 100 goes. Running the same program from command line without starting anything else I had to increase the loop to 5000. Also, for some reason the gpu crashing is now being caught by assert self.surface != EGL_NO_SURFACE i.e. it won't run but doesn't freeze, so maybe I'm not reproducing exactly the same error as before or maybe the general case stops working at some random opengles function where there is no error trapping (more likely)

Anyway, any pointers for SysRq appreciated and I will try and get something in dmesg for you.

cleverca22 commented 11 years ago

oops, its case sensitive, must be a lowercase L or W

l doesnt show up well in this font, so i typed typing L to make it more obvious, but that seems to defeat the entire purpose!

paddywwoof commented 11 years ago

I'm 99.9% sure I tried that as all the help I found online had lower case characters. And I've tried it now but no joy. The actual help text that gets bounced back (after an unrecognised char) is:

SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) debug(G) kill-all-tasks(I) thaw-filesystems(J) saK show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W)

(unhelpfully upper case, so you're following a fine tradition ;)) There doesn't look to be an 'l'. 'w' does work but I can't see anything helpful in dmesg. Mainly cfs_rq[0]:/autogroup-n as below.

However, as I mentioned above, when I reproduce the error by creating and destroying the display thousands of time I get python closing with the assert statement so the process is no longer there to watch. I will try to get a 'natural' freeze then run w > /proc/sysrq-trigger and see what I get

[ 1085.111365] SysRq : Show Blocked State [ 1085.115146] task PC stack pid father [ 1085.115249] Sched Debug Version: v0.10, 3.6.11+ #456 [ 1085.115264] ktime : 1084986.777538 [ 1085.115274] sched_clk : 1085115.241000 [ 1085.115284] cpu_clk : 1085115.241000 [ 1085.115292] jiffies : 78498 [ 1085.115297] [ 1085.115302] sysctl_sched [ 1085.115312] .sysctl_sched_latency : 6.000000 [ 1085.115324] .sysctl_sched_min_granularity : 0.750000 [ 1085.115334] .sysctl_sched_wakeup_granularity : 1.000000 [ 1085.115343] .sysctl_sched_child_runs_first : 0 [ 1085.115352] .sysctl_sched_features : 24119 [ 1085.115363] .sysctl_sched_tunable_scaling : 1 (logaritmic) [ 1085.115370] [ 1085.115370] cpu#0 [ 1085.115379] .nr_running : 3 [ 1085.115387] .load : 3072 [ 1085.115394] .nr_switches : 1843928 [ 1085.115403] .nr_load_updates : 38960 [ 1085.115411] .nr_uninterruptible : 0 [ 1085.115420] .next_balance : 0.000000 [ 1085.115428] .curr->pid : 2539 [ 1085.115438] .clock : 1085110.604000 [ 1085.115446] .cpu_load[0] : 1024 [ 1085.115454] .cpu_load[1] : 512 [ 1085.115462] .cpu_load[2] : 256 [ 1085.115470] .cpu_load[3] : 128 [ 1085.115478] .cpu_load[4] : 64 [ 1085.115486] .yld_count : 0 [ 1085.115493] .sched_count : 1844498 [ 1085.115502] .sched_goidle : 440228 [ 1085.115510] .ttwu_count : 1357083 [ 1085.115517] .ttwu_local : 0 [ 1085.115531]

[ 1085.115544] .exec_clock : 214.815000 [ 1085.115557] .MIN_vruntime : 0.000001 [ 1085.115567] .min_vruntime : 210.604423 [ 1085.115578] .max_vruntime : 0.000001 [ 1085.115587] .spread : 0.000000 [ 1085.115598] .spread0 : -173767.254293 [ 1085.115605] .nr_spread_over : 1 [ 1085.115613] .nr_running : 1 [ 1085.115621] .load : 1024 [ 1085.115631] .se->exec_start : 1085110.604000 [ 1085.115641] .se->vruntime : 173974.929716 [ 1085.115651] .se->sum_exec_runtime : 214.886000 [ 1085.115660] .se->statistics.wait_start : 0.000000 [ 1085.115670] .se->statistics.sleep_start : 0.000000 [ 1085.115680] .se->statistics.block_start : 0.000000 [ 1085.115690] .se->statistics.sleep_max : 0.000000 [ 1085.115699] .se->statistics.block_max : 0.000000 [ 1085.115709] .se->statistics.exec_max : 27.599000 [ 1085.115719] .se->statistics.slice_max : 10.004000 [ 1085.115729] .se->statistics.wait_max : 30.136000 [ 1085.115738] .se->statistics.wait_sum : 297.548000 [ 1085.115747] .se->statistics.wait_count : 1338 [ 1085.115755] .se->load.weight : 1024 [ 1085.115764]

[ 1085.115776] .exec_clock : 1630.055000 [ 1085.115785] .MIN_vruntime : 0.000001 [ 1085.115795] .min_vruntime : 1396.418402 [ 1085.115804] .max_vruntime : 0.000001 [ 1085.115813] .spread : 0.000000 [ 1085.115822] .spread0 : -172581.440314 [ 1085.115830] .nr_spread_over : 10 [ 1085.115837] .nr_running : 0

cleverca22 commented 11 years ago

i was using my laptop for reference, L was show-backtrace-all-active-cpus(L) its the same as w, but it lists EVERY process, let me boot my pi up and see if i can reproduce anything

ah, i see the problem, it wasnt L to begin with, oops!

raspberrypi ~ # echo > /proc/sysrq-trigger t

and then checking dmesg for the frozen pid, i get this http://www.privatepaste.com/9698b2b593

https://github.com/raspberrypi/linux/blob/rpi-3.6.y/drivers/misc/vc04_services/interface/vchiq_arm/vchiq_core.c#L2628

this is the line where it hangs

it appears to be queuing a VCHIQ_MSG_OPEN msg to the gpu, then waiting for it to reply, and it never does

without the gpu source, the most i can do is make it die when you -9, but the gpu is likely to not recover

bluerobert commented 11 years ago

For your information only: When I start the omxplayer around 6 to 200 times it blocked randomly. omxplayer blocked : https://github.com/huceke/omxplayer/issues/178

cleverca22 commented 11 years ago

i did some quick tests with the sample program in the first post, if i run it in an infinite loop, it will crash within 120 runs but if i put in a 1 second delay, it takes 400 runs to crash

doesn't seem like a memory leak to me, seems more like random chance or a race condition

and now that i think of it, ive also sometimes had omxplayer lock the entire gpu up, but didnt think much of it, it reboots so fast, it wasnt really a bother!

rec commented 11 years ago

Oh, that's really distressing. It might not be a bother if you're an individual sitting in front of a computer - but honestly, if that were your target, why would you use a Raspberry Pi in the first place?

It seems to me that a big use for the Raspberry Pi is embedded systems - a kiosk, a console, a display somewhere. The RP makes all sorts of things possible, since the computer costs less than the display! But if the unit "randomly" locks up or crashes, all of these things are no longer possible.

My suggestion - we should work on resolving the OpenGL issue because that's easier to identify, and then once that's all solid and done, we should then turn our attention to omxplayer - with luck fixing OpenGL will fix that, else we have at least eliminated one issue.

Thanks to all for their work on this! I'll forward this to Patrick, who's been running these tests but doesn't seem to be CC'ed on this bug, and perhaps we can get you some better instrumentation...

paddywwoof commented 11 years ago

Tom, I do keep occasionally checking here! @cleverca22, when you say '..without the gpu source..' is that something that Broadcomm have to fix or is it at github.com/raspberrypi/userland/ ? I haven't even attempted to get my head round any of the code there (or here)

It also feels more like a random event to me. Did you put the pause between the bcm_host_init, eglGetDisplay and eglInitialize? I might try that. [doesn't really make sense though because it has to wait for the functions to return something. My assertion fails after elgGetDisplay has returned EGL_NO_DISPLAY will try a delay anyway]

paddywwoof commented 11 years ago

OK so I withdraw my opinion on randomness! After running with 0.25s pause between the three functions above (twice) and with 0.1s pause I get the crash after 1023 loops i.e. presumably on the 1024th attempt.

paddywwoof commented 11 years ago

I've now worked my way down delaying: 0.25, 0.1, 0.01, 0.001, zero and it always fails after 1023 The fact that it appeared random before must have been because I didn't run the script to cause the crash immediately after rebooting the pi. I will now try running a few other egl programs before running the crash generator and see what else might eat into the number of lives

rec commented 11 years ago

As you're implying, the exact of count 1023 seems to prove that this isn't actually a memory leak but some other resource with 1023 entries or index that's being incremented until it reaches a limit.

This might make it easier to find in the source code. There might well be a constant set to be 1024 that's strongly connected to the Bad Thing.

On Mon, Jun 10, 2013 at 2:22 PM, paddywwoof notifications@github.comwrote:

I've now worked my way down delaying: 0.25, 0.1, 0.01, 0.001, zero and it always fails after 1023 The fact that it appeared random before must have been because I didn't run the script to cause the crash immediately after rebooting the pi. I will now try running a few other egl programs before running the crash generator and see what else might eat into the number of lives

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19216864 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

rec commented 11 years ago

I checked in the test we're using herehttps://github.com/tipam/pi3d/blob/develop/experiments/CreateManyDisplays.py.

If you wanted to try it out you could download the "devel" branch of pi3dhttps://github.com/tipam/pi3d/archive/develop.zipand run the file experiments/CreateManyDisplays.py directly.

On Mon, Jun 10, 2013 at 2:25 PM, Tom Swirly tom@swirly.com wrote:

As you're implying, the exact of count 1023 seems to prove that this isn't actually a memory leak but some other resource with 1023 entries or index that's being incremented until it reaches a limit.

This might make it easier to find in the source code. There might well be a constant set to be 1024 that's strongly connected to the Bad Thing.

On Mon, Jun 10, 2013 at 2:22 PM, paddywwoof notifications@github.comwrote:

I've now worked my way down delaying: 0.25, 0.1, 0.01, 0.001, zero and it always fails after 1023 The fact that it appeared random before must have been because I didn't run the script to cause the crash immediately after rebooting the pi. I will now try running a few other egl programs before running the crash generator and see what else might eat into the number of lives

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19216864 .
 /t
http://radio.swirly.com - art music radio 24/7 366/1000

/t

http://radio.swirly.com - art music radio 24/7 366/1000

cleverca22 commented 11 years ago

it would be possible to modify the kernel to detect this problem (add a timeout?) and then auto-reboot, but fixing the root problem would likely only be possible for broadcom

rec commented 11 years ago

I'm a little unclear who's on this thread - do we need to/should we file another bug somewhere or email someone?

On Mon, Jun 10, 2013 at 3:06 PM, cleverca22 notifications@github.comwrote:

it would be possible to modify the kernel to detect this problem (add a timeout?) and then auto-reboot, but fixing the root problem would likely only be possible for broadcom

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19219425 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

popcornmix commented 11 years ago

@rec I'm aware of the problem, but I have a long list of things to fix.

I'm hoping that a key bit of information will be posted in this thread that will make the solution become clear, as tracking down a race condition in the vchiq startup/shutdown that only occurs occasionally sounds like a particularly tricky problem to solve.

It will be solved when I get the chance.

rec commented 11 years ago

Yes, I can only imagine how much work there is there! Very sympathetic.

The most recent test case does however appear to be deterministic and fail after exactly 1023 loops, so it might be somewhat easier to track down.

Best!

On Mon, Jun 10, 2013 at 3:52 PM, popcornmix notifications@github.comwrote:

@rec https://github.com/rec I'm aware of the problem, but I have a long list of things to fix.

I'm hoping that a key bit of information will be posted in this thread that will make the solution become clear, as tracking down a race condition in the vchiq startup/shutdown that only occurs occasionally sounds like a particularly tricky problem to solve.

It will be solved when I get the chance.

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19221982 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

cleverca22 commented 11 years ago

sounds like 2 related problems, one crashing it after ~100 opens, and another at 1024, the only way i can see to progress, would be to get a backtrace of the gpu core, and inspect the code in that path

it is easy to reproduce, just run the program from the 1st post inside this

for x in {0..200};do ./main;done

but depending on how the gpu debug tools work, a hung gpu may make it hard to debug?, jtag time?

paddywwoof commented 11 years ago

It's slightly odd that running (effectively) the same routine but from python gives the 1024 limited error whereas the problem I first noticed was a random freezing after 5 to 100 egl startups, very much like @yoth's symptoms.

I have made a little script to open and shut the Minimal graphics demo until it freezes and put the processes dmesg here http://www.privatepaste.com/91f23f36c1 It's very similar to the one @cleverca22 generated but has various other python and terminal processes, also there are two python and VCHIQ completio blocks for some reason. The dump for the 1024 bug only has the initial VCHIQka-0 section.

paddywwoof commented 11 years ago

I've run the RunMultipleMinimals 24 times now and though it's not a thorough statistical analysis but it does look very like a random occurrence with a probability = 3% that egl will freeze starting up. gpu_crash

cleverca22 commented 11 years ago

that dump appears to include do_exit, which i think is a secondary problem if you kill -9 while its hung on the first problem, it will then try to properly disconnect all vhciq stuff, and with the GPU hung, that then causes the process to lock up solid

AdrienSchwartzentruber commented 11 years ago

I have a similar issue on my device (I fed a post on the Raspberry forum : http://www.raspberrypi.org/phpBB3/viewtopic.php?f=70&t=46873&p=368452#p368452).

On my case, I can see the freeze more often but I'm using more higher layers (using gstreamer-1.0). I exactly see the same stack that @cleverca22 in dmesg.

I will be happy to help, and i'm looking forward on this issue.

hehe2 commented 11 years ago

Damned... we got a buggy GPU driver and it's closed source, what can you do ? Even with the best intentions and knowledge, you won't be able to fix this issue...

Should we use avaaz to make a global petition to get tne full GPU source code disclosure ?

rec commented 11 years ago

It's a nice idea but I suspect it isn't going to happen. Someone has invested a lot of money in that driver and it's likely that they don't see the profit in open-sourcing it. Short-sighted, IMHO, but there we are.

I have to say that this has, for me, put RP graphics development somewhat on the back burner. The fact that this happens in ALL graphical applications, including omxplayer, makes it impossible to come up with a workaround. Most of my applications are installations and other things that work for a long time. I can't tell clients, "It will work for a while and then it will stop and someone has to go and reboot it." (Yes, I can do it with a cron job and might in a pinch - but how lame is that? And, what if the program gets a lot of use in a short period and then hangs long before the cron goes off?)

I recently received a Beagle Bone Black and I'm playing with that (though I suspect that's even less mature).

Regardless, for the moment, for graphics the Raspberry Pi is in the "toy, not yet ready for prime time" category, and will remain so until someone can demonstrate OpenGL programs that can run continuously without needing to reboot the machine periodically.

On Fri, Jun 14, 2013 at 10:42 AM, hehe2 notifications@github.com wrote:

Damned... we got a buggy GPU driver and it's closed source, what can you do ? Even with the best intentions and knowledge, you won't be able to fix this issue...

Should we use avaaz to make a global petition to get tne full GPU source code disclosure ?

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19460842 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

bluerobert commented 11 years ago

Hi Tom, 100% ACK.

cleverca22 commented 11 years ago

i could modify the kernel to detect this problem, and return an error code, or force a hard reset ASAP

then you wont need a cron job, it will fix itself

but it will also have a chance of just rebooting the instant a user tries to interact with it

rec commented 11 years ago

If the error isn't going to be fixed for a while, returning an error code might be the best of a bad job - you can detect it and reboot, or just ignore it.

But there aren't really any good choices that don't involve preventing the bug from happening in the first place.

On Fri, Jun 14, 2013 at 2:24 PM, cleverca22 notifications@github.comwrote:

i could modify the kernel to detect this problem, and return an error code, or force a hard reset ASAP

then you wont need a cron job, it will fix itself

but it will also have a chance of just rebooting the instant a user tries to interact with it

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19473578 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

cleverca22 commented 11 years ago

i'll also need to fix it ignoring kill -9, or a reboot will never finish

paddywwoof commented 11 years ago

Not that I want to downgrade the urgency of this fix but.. I think most situations where the RPi will be running as a console would create the display surface just once and leave it there, simply changing or moving the content, which seems to be fine. Where a new surface is being created repeatedly (as causes this error) it would normally have a user starting an app. I noticed the problem because of my hack coding: tweak a line, run it, find a typo, fix it, run it, doesn't do what I expected, fix it, run it, etc etc. even then it didn't happen often enough for me (or anyone else) to notice it.

If it's relevant I don't think it happened with earlier versions of the operating system, was there a gpu firmware upgrade 6 months ago?

Is this same chip/firmware used in 'proper' devices with the likes of apple or samsung ready to lean on broadcom?

popcornmix commented 11 years ago

This will be fixed. Hopefully I'll get some time to spend on it this week.

rec commented 11 years ago

Can we send you chocolate, beer or gourmet coffee to help you in your endeavour? :-)

Best luck to you, let us know if we can do anything at all to help.

On Sat, Jun 15, 2013 at 7:49 AM, popcornmix notifications@github.comwrote:

This will be fixed. Hopefully I'll get some time to spend on it this week.

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-19495339 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

popcornmix commented 11 years ago

Okay, I've managed to spend some time on this and found a possible deadlock case (using vchiq with a lock held, that a vchiq callback wanted to acquire). I fixed that and my crash test is still running after rather longer than usual.

Try rpi-update.

rec commented 11 years ago

Oh, well done!

I'll be on that soonest.

Deadlocks, sigh. The amount of time I've spent on them in my life...

On Thu, Jun 27, 2013 at 12:18 PM, popcornmix notifications@github.comwrote:

Okay, I've managed to spend some time on this and found a possible deadlock case (using vchiq with a lock held, that a vchiq callback wanted to acquire). I fixed that and my crash test is still running after rather longer than usual.

Try rpi-update.

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-20135096 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

rec commented 11 years ago

I went through rpi-update and the reboot.

I have one test available to me right now, this

onehttps://github.com/tipam/pi3d/blob/develop/experiments/CreateManyDisplays.py

but unfortunately that still had the same problem of running 1022 times and then failing on the 1023rd time... :-(

Patrick's more the expert on this problem, let's see what happens when he weighs in...

On Thu, Jun 27, 2013 at 12:20 PM, Tom Swirly tom@swirly.com wrote:

Oh, well done!

I'll be on that soonest.

Deadlocks, sigh. The amount of time I've spent on them in my life...

On Thu, Jun 27, 2013 at 12:18 PM, popcornmix notifications@github.comwrote:

Okay, I've managed to spend some time on this and found a possible deadlock case (using vchiq with a lock held, that a vchiq callback wanted to acquire). I fixed that and my crash test is still running after rather longer than usual.

Try rpi-update.

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-20135096 .
 /t
http://radio.swirly.com - art music radio 24/7 366/1000

/t

http://radio.swirly.com - art music radio 24/7 366/1000

paddywwoof commented 11 years ago

Just a quick look in, I'm not going to get chance to do anything until Monday unfortunately. Maybe there are two issues: a) deadlock with x% chance of happening giving the random failure over a few hundred initialisations and b) some resource allocation that isn't released properly and crashes at 1024. If the C crash test doesn't hit this specific limit then it's likely that the issue is in the python wrapper, if it does then it sounds like two graphics issues.

There is a question as to why the CreateManyDisplays didn't get tripped up by the deadlock issue!!

Tom, there is another test routine: https://github.com/tipam/pi3d/blob/develop/experiments/RunMultipleMinimals.py that gave the random failure - you don't need the 7s delay, it doesn't seem to make any difference.

@popcornmix thanks for your help on this.

popcornmix commented 11 years ago

The issue I've fixed is the one reported in the first post of this thread. It sounds like there is a different, unrelated bug with 1024 "somethings" running out.

popcornmix commented 11 years ago

I've dug in a bit. Seems to be two issues. The main one is a bug in pi3d. To remove an element, you need an update. It should look like:

    self.dispman_update = bcm.vc_dispmanx_update_start(0)
    bcm.c(self.dispman_update, self.dispman_element)
    bcm.vc_dispmanx_update_submit_sync(self.dispman_update)

and that should be done before bcm.vc_dispmanx_display_close. You are removing the element from the display.

Currently bcm.vc_dispmanx_element_remove is returning an error (which should be checked) and not removing the element. The failure because we have 1024 elements "active", which is a hard coded limit.

I believe this fixes the reported problem. You now won't fail after 1024 iterations.

The second issue is we don't clean up correctly when the python process quits. Normally resources should be freed when the process finishes. (GPU side) dispmanx keeps track of active elements so it can free them on process exit. It sees the vc_dispmanx_element_remove call and removes it from it's clean up list. Unfortunately it doesn't check if it failed to remove it. I'll look into that.

popcornmix commented 11 years ago

Firmware is now updated to correctly free elements/resources that have previously returned an error when attempting to free them.

CreateManyDisplays.py will now correctly fail after 1024 updates, but after killing python and restarting it should run for another 1024 iterations.

The bug in py3d needs a patch like: http://pastebin.com/JfDuU6sj

rec commented 11 years ago

Well, this is super-excellent.

I'll have a chance to put this in in just a few minutes... I'll let you know how it goes!

On Fri, Jun 28, 2013 at 11:31 AM, popcornmix notifications@github.comwrote:

Firmware is now updated to correctly free elements/resources that have previously returned an error when attempting to free them.

CreateManyDisplays.py will now correctly fail after 1024 updates, but after killing python and restarting it should run for another 1024 iterations.

The bug in py3d needs a patch like: http://pastebin.com/JfDuU6sj

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-20195347 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

rec commented 11 years ago

So this overdelivers!

I checked in the patch to pie3, and did the whole update and reboot action again - now the CreateManyDisplays runs without stopping for at least 5000 iterations. Not quite sure how that would really work...??

I'm now running RunMultipleMinimals.py a bunch of times to see what's happening there.

On Fri, Jun 28, 2013 at 11:35 AM, Tom Swirly tom@swirly.com wrote:

Well, this is super-excellent.

I'll have a chance to put this in in just a few minutes... I'll let you know how it goes!

On Fri, Jun 28, 2013 at 11:31 AM, popcornmix notifications@github.comwrote:

Firmware is now updated to correctly free elements/resources that have previously returned an error when attempting to free them.

CreateManyDisplays.py will now correctly fail after 1024 updates, but after killing python and restarting it should run for another 1024 iterations.

The bug in py3d needs a patch like: http://pastebin.com/JfDuU6sj

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-20195347 .
 /t
http://radio.swirly.com - art music radio 24/7 366/1000

/t

http://radio.swirly.com - art music radio 24/7 366/1000

paddywwoof commented 11 years ago

@popcornmix I haven't managed to hit a limit opening and closing EGL displays so I believe the issue is (both issues are) fixed!

Thanks again for all your help.

rec commented 11 years ago

As far as I can tell this has fixed it. Thanks to popcornmix! There aren't many circumstances where this issue would actually cause a problem but if you run into it you will need to have the latest version of pi3d/util/DisplayOpenGL.py and run rpi-update (apt-get update/upgrade doesn't work, presumably it will once the fixes move into the official release)

popcornmix commented 11 years ago

@yoth can you test and close?

yoth commented 11 years ago

I'm currently on vacation and don't have a Pi with me. Will test this in about 2 weeks.

popcornmix commented 11 years ago

Okay. There's enough evidence that the problem is fixed, so I'll close it. Please reopen if your testing fails.

rec commented 11 years ago

Thanks for all your work!!!

On Tue, Jul 2, 2013 at 7:11 AM, popcornmix notifications@github.com wrote:

Okay. There's enough evidence that the problem is fixed, so I'll close it. Please reopen if your testing fails.

— Reply to this email directly or view it on GitHubhttps://github.com/raspberrypi/firmware/issues/185#issuecomment-20339749 .

/t

http://radio.swirly.com - art music radio 24/7 366/1000

yoth commented 11 years ago

I'm back from vacation. The problem is fixed - I can no longer reproduce it. Also tested similar problems that occured with omxplayer and gstreamer. They are gone too. Good work!

raspberrypi / firmware

Repeatedly using EGL freezes VideoCore #185

onehttps://github.com/tipam/pi3d/blob/develop/experiments/CreateManyDisplays.py