p-e-w / finalterm

At last – a modern terminal emulator (NO LONGER MAINTAINED)
http://finalterm.org
GNU General Public License v3.0
3.84k stars 179 forks source link

Fix Ctrl+D Crashes and Freezes (#296) #311

Closed onli closed 10 years ago

onli commented 10 years ago

Cleaned version of https://github.com/p-e-w/finalterm/pull/306

p-e-w commented 10 years ago

Seeing the cleaned code, I have at last understood how your approach works (please correct me if I'm wrong): Every second, it checks whether posixCloseEvent is set, and if yes, closes the active pane. posixCloseEvent, on the other hand, is set when a shell process exits but we are not within the 2 second closingProcessRunning time window, the assumption being that in this case the shell must have exited directly.

Regrettably, now that I grasp the idea, I have been able to break it quite easily:

Clearly, lowering the loop timeout would not fix this, only lowering the kill timeout could – but this is likely to reintroduce crashes again :unamused:

Another issue, at least in theory, is that it is possible to terminate a shell, then quickly switch to another tab/pane so that the wrong TerminalWidget is active when the loop realizes the shell was closed, resulting in the wrong widget getting closed. This again could be mitigated by lowering the timeout, but that would get arbitrarily ugly while never being completely safe. More issues might exist, and by rapidly closing panes and switching around I have actually been able to produce all kinds of weird and unwanted behavior.

We should probably look at the threading/signalling angle again, and abandon timeouts altogether. The fact that, as you wrote, threads_add_idle causes crashes again seems to indicate that the actual cause might be a different one anyway.

muchweb commented 10 years ago

:+1: Application crashes when I open two panes, and type ^D in one. This does not depend on time (crashes even when used for some time).

onli commented 10 years ago

Ok, I see.

Clearly, lowering the loop timeout would not fix this, only lowering the kill timeout could – but this is likely to reintroduce crashes again :unamused:

Yep. So I retried threads_add_idle. The interesting thing is that it is not the same crash - but rather it looks like the exact same lock, but this time already when calling Gdk.threads_add_idle(()and way less often - so something (I still suspect something animation related) seems to lock the application when calling Gdk-functions/Gdk-thread related stuff (and that is beyond my level so far).

So, workaround is to not do this, to find a good way to react onto the signal from the main thread. g_io_channel maybe, but so far I'm trying to query this in threads_add_idle (which has the risk of higher cpu usage), and I think that is a working solution. Will test that a bit more and push the result (in any case, it's better than the current state). Just as an update.

@muchweb: Can't confirm this, sorry.

onli commented 10 years ago

Ok, pushed. I even got the g_io_channel working, so this is now almost instant, using no timeouts, and I saw no issues with panes or timings.

Maybe Gdk.threads_add_idle(() is simply invalid to call outside the main thread as well, the documentation uses it rather for widgets who could later be already finalized, calling it inside the main thread. And trying the clutter-locks and thread-calls made no difference at all, so my hunch with animations was probably wrong.

Edit: Introduced a really dumb bug when cleaning up, fixed that in the follow-up commit.

p-e-w commented 10 years ago

I just pushed a commit to master making two changes:

Please try if you can reproduce crashes with these changes. If no, I guess we can consider the crash/segfault issues solved, leaving the (reduced) hangs.

onli commented 10 years ago

If I see that right, to wrap it in Gdk.threads_add_idle will greatly reduce the hangs on ctrl+d, but it will still freeze sometimes (about 1 in 50 times in my tests with my commits, felt like more if closing the tabs very fast). To get around this issue, the "signal the main-thread"-approach seems to be needed.

But I will test it the next days. Maybe the two can be combined and thus the timeout replaced. Nice catch.

onli commented 10 years ago

Ok, had an earlier chance to test this then I thought.

The crashes are gone, that is nice. But all in all, it is not working properly for me. Ctrl+D still freezes very frequently, and tab closes freeze as well sometimes.

If you want, I could fix the bug with the zombie tabs in the last commit and send this again. I could probably keep both your fixes and thus get rid of the timeout as well (and of course fix my typo ;) ).

I saw your comments and realize one doesn't have to like the approach with the fifo - and that my first approaches were all a bit flawed probably doesn't help - but it is working perfectly now; I had not one crash and not one freeze, and this includes my tests to toggle timing issues and panes -> we could completely fix the bug. And if there are really (small) new issues, or stuff to be improved like another way for the thread-communication, couldn't this be better done later (after more people could play with a more stable finalterm)?

p-e-w commented 10 years ago

And if there are really (small) new issues, or stuff to be improved like another way for the thread-communication, couldn't this be better done later (after more people could play with a more stable finalterm)?

As the crashes are already gone completely and the hangs are gone almost completely I don't think it's necessary to introduce any hacks at this point anymore. Let's fix this issue properly. The only task that remains after this change is to fix the hangs. I am now almost certain that these are caused by an entirely different problem. You had the crucial insight that the crashes were caused by GTK actions being performed from outside of the main thread. The mitigation for this is the threads_add_idle fix, which on my system works perfectly (I still have not produced a single crash with it). Indeed, any issue that is caused by GTK thread safety limitations should be completely eliminated by the use of threads_add_idle if the docs are to be believed, and I do believe them.

The fact that hangs still occur indicates that they are not caused by GTK threading issues at all. The fact that (at least on my system), the number of hangs was significantly reduced after the command_channel fix further corroborates that idea.

I therefore believe that your pull request fixing not only the crashes, but the hangs as well has nothing to do with threading at all. Instead, I suspect it to be related to the 2000 ms timeout, which basically gives the shell time to terminate before further action can occur. The hangs must be caused by a deadlock, as the process does not consume CPU time when it hangs. The timeout probably allows the lock to be released by preventing a race condition.

Alas, all this is becoming very difficult to test on my system as the command_channel fix has almost eliminated hangs for me. My testing procedure is as follows:

  1. Start Final Term
  2. Open 4 additional tabs (,tab 4)
  3. Press Ctrl+D to close them one by one

In the past 20 trials, closing 100 widgets in total, I have seen only a single hang (and no crashes). Do you know of a better way to reproduce hangs?

In any case, I don't believe it makes sense to pursue the "signal the main thread" angle any further – because we are signalling the main thread now. If main thread race conditions somehow still were the issue, this would be a bug in GTK, and while there of course are bugs in GTK, I don't think this is one. Instead, we should try to find the cause of the hangs (lock?). I'm going to fire up Nemiver now and try to understand what causes the lock. You seem to have done some research in this direction on your blog already. When it freezes now, does it still get stuck in the same lock?

onli commented 10 years ago

I don't think to use a fifo for thread communication is a hack - especially since that seems to run way smoother, at least on my machine (I mean not in regards to the hangs, but responsiveness).

But I see you point. Indeed, the documentation states that threads_add_idle is a safe way to signal the main thread what to do, so either we are triggering a gdk-bug, or I was just covering the real bug by accident.

I thought different because of this stacktrace (I used gdb): https://gist.github.com/onli/47dbca50bb83d4e2a31b For me this looked like threads_add_idle itself is running into the mutex deadlock, but sure, it is quite possible this is not correct, this is not my strong-suit.

When it freezes now, does it still get stuck in the same lock?

This (the stracktrace above) is where it hangs now, and this is where it hanged before when I tried to use threads_add_idle.

Do you know of a better way to reproduce hangs?

Same procedure, just opening 30 tabs instead (even in the beginning, I used at least 10)

I just got a hang caused by command_channel.read_unichar (In Terminal.vala:282), when opening the 4 tabs for the seconds time and trying to close them! Oh, and seems to be reproducible. I'm inclined to take this as a hint that you are correct, this may be channel related (and a bit strange, didn't see this before).

p-e-w commented 10 years ago

I must admit I'm a bit out of my depth here. I see the stacktrace alright, but what does it really mean? The thread seems to be stuck in a mutex lock, eventually caused by g_source_attach (the culprit seems to be this line: https://github.com/GNOME/glib/blob/master/glib/gmain.c#L1142). I'm not sure how to proceed from there. Will investigate further on the weekend.

onli commented 10 years ago

Did you have any luck? Or need some help (if I can provide)?

I am still irritated that my workaround seemed to work stable on my system but not on yours, maybe that says us something.

p-e-w commented 10 years ago

I am still irritated that my workaround seemed to work stable on my system but not on yours, maybe that says us something.

Indeed. All signs seem to point towards a race condition which might be affected by system speed and other external parameters.

To my shame I must confess that I did not find time to investigate during the past weekend as I had promised. I really do want to see this issue closed for good though so I will try my best to do it the upcoming weekend instead. :8ball:

p-e-w commented 10 years ago

I did a bit more debugging work this weekend but it feels like trying to get a picture of a dark room by repeatedly poking a stick into it.

I can now produce a hang only by opening and closing at least 50 tabs. I have never seen another crash. So while a tiny issue remains, and remains mysterious, #296 has been reduced from a show-stopper to a minor nuisance.

I am not going to merge this PR because the "sleep" approach feels unclean and creates other issues as described above. However, your key insight that there is a GTK thread safety issue underneath all this was the foundation for the simple and effective fix https://github.com/p-e-w/finalterm/commit/6ec39f8215b0732f52f74b0d96ed48abbffc1d9f. I don't think I could have guessed this problem source even if I had spent a year working on the bug. I therefore would like to award you the bounty I placed on #296, and invite you to claim it on Bountysource so I can do just that. :smiley: :moneybag:

onli commented 10 years ago

I'm really happy I could help a bit.

I therefore would like to award you the bounty I placed on #296

I happily accept, thank you :) I tried to claim it there.

p-e-w commented 10 years ago

I closed #296 per the instructions on Bountysource and you should now be able to claim the bounty.