p-e-w / finalterm

At last – a modern terminal emulator (NO LONGER MAINTAINED)
http://finalterm.org
GNU General Public License v3.0
3.84k stars 179 forks source link

wait after kill, seems to fix one cause of UI hang (#296) #306

Closed onli closed 10 years ago

onli commented 10 years ago

I don't claim this is a perfect fix. It is a workaround that seems to work on my machine, but you know how hairy those hard to reproduce issues are.

Everything following might be wrong: Posix.kill is async, and seems to be the issue of the ui hang after tab close. Waiting a bit gives the xserver time to send the signal, and the process to actually terminate, before finalterm continues with the close procedure. If that is not done, to my understanding, it might happen that the pid is already used for another process, which could be harmful. It could also be that finalterm happily continues with an unclosed shell-process, which could also be harmful.

My testing procedure was to repeatedly open 10 tabs and to close all of them in a row. Without the sleep, this almost never worked, with the sleep, it always worked.

It might be better to use waitpid instead; I didn't succeed with that.

p-e-w commented 10 years ago

I think you might be on to something here, but I actually get crashes and hangs also (mostly, in fact) when exiting the shell directly, in which case terminate_shell is never called.

To reproduce, spawn a lot of tabs (,tab 10), then repeatedly press Ctrl+D to close them one by one. Observe that terminate_shell is never called, yet there is usually a crash or hang.

In total, I have observed three different behaviors in #296, all undesirable:

At least the first two seem to occur independently from terminate_shell.

onli commented 10 years ago

Yep, definitely agree. There is more than one cause for crashes and hangs.

I will see if I can reproduce the Ctrl+D crashes, I used the x in the tab so far.

edit: yes, that hangs immediately or crashes, but also without the timeout.

onli commented 10 years ago

Short update: I reorganised the code a little bit, and in my VM at least (before it behaved a little bit different on my real system, more memory related crashes, we will see if that is all) hunted it down to this stacktrace: https://gist.github.com/onli/6e34b3d543c1dbbd524e. That is the notebook.page_remove in nested_container, and looks - at least with my limited knowledge in that area - like a bug in clutter/gtk.

Don't know what is the specific cause triggering this. I'm trying to rewrite that part to workaround the issue.

I had a hunch first that some other crashes had to do with the removal of the child from children when that is the main reference, but that lead nowhere so far. Might come next, but should have been present then in the tab removal anyway.

Still, the tab-close was stable for me all that time with the patch, I think you could merge that :)

onli commented 10 years ago

With the new fix, Ctrl + D seems to work stable (I had one random crash from the output-stream, but it worked many times without crash). Waiting for feedback :)

p-e-w commented 10 years ago

The good news is that with your changes, I have thus far been unable to produce any crash :fireworks:

The bad news is that this approach introduces a "wait time" between exiting a shell and the tab being closed, which is not really what we want from a UX perspective. Also, I have found that this leaves defunct shell processes around, which is not surprising since it removes the call to waitpid. On a side note, much of the changeset seems to be nonfunctional and should be removed; I have annotated the appropriate portions of the diff.

Overall, you are clearly on the right path as the crashes are gone, but given the new problems I'm afraid it's not quite there yet. Please don't give up though, I want to see a proper fix more than anything! My feeling is that without a full understanding of what really causes the problem, such a fix will be very hard to pull off. I agree with you that the cause might well lie in GTK+ (or Clutter, though I think that less likely). One hint could be that in older commits such as https://github.com/p-e-w/finalterm/commit/b34c51b739588bd0523b2f2379b1d96edf587038, the issue does not arise.

It's cool that you blogged about your efforts on https://www.onli-blogging.de/, BTW :smiley:

onli commented 10 years ago

My feeling is that without a full understanding of what really causes the problem, such a fix will be very hard to pull off

The working fix is based upon a theory, though I did not went far enough into the C-Code or the assembled product to check if it is correct: The Posix-Signals are probably not part of the GTK-Thread/are not catched inside. But it is invalid to manipulate GTK-objects from outside the main thread (and the functions which could word around that, gdk_threads_init, are deprecated, and didn't work anyway). The symptoms fit exactly to what I found described to what could happen.

But it could of course also be a side-effect of not removing the shell process (but then it should freeze as well when closing tabs), or a timing thing.

The bad news is that this approach introduces a "wait time" between exiting a shell and the tab being closed, which is not really what we want from a UX perspective.

The timer-waittime could of course be reduced, though that would rise cpu-usage a bit. Maybe there is a good way to instantly react on the change without polling explicitly? I didn't want to make the fix more complicated with stuff like that, but that sure should be something to consider in the future.

Also, I have found that this leaves defunct shell processes around, which is not surprising since it removes the call to waitpid.

The nice thing about this is that the waitpid-approach is not necessarily needed anymore, if I understand the architecture correctly. I will add a function to close the command_channel in Terminal, which was called using the waitpid-pid, called now from the close event in TerminalWidget.

On a side note, much of the changeset seems to be nonfunctional and should be removed

Sorry about that, I will cleanup the commit. How does that work properly - it is probably easiest to just send another, clean pull request?

It's cool that you blogged about your efforts on https://www.onli-blogging.de/, BTW :smiley:

Thanks :)

p-e-w commented 10 years ago

Maybe there is a good way to instantly react on the change without polling explicitly?

Have a look at http://valadoc.org/#!api=gdk-3.0/Gdk.threads_add_idle: _"This variant of add_full calls function with the GDK lock held. It can be thought of a MT-safe version for GTK+ widgets for the following use case, where you have to worry about idlecallback running in thread A and accessing self after it has been finalized in thread B:"

Maybe this is already the answer and makes polling obsolete. If your theory is true and the underlying cause is a threading problem, this should fix it. It's also not deprecated, unlike threads_init.

The nice thing about this is that the waitpid-approach is not necessarily needed anymore, if I understand the architecture correctly. I will add a function to close the command_channel in Terminal, which was called using the waitpid-pid, called now from the close event in TerminalWidget.

The main purpose of waitpid is not to provide a callback supplying the PID but to prevent the process from turning defunct ("zombie"). The docs at http://linux.die.net/man/2/waitpid indicate that "if a wait is not performed, then the terminated child remains in a "zombie" state", so there seems to be no alternative to it.

How does that work properly - it is probably easiest to just send another, clean pull request?

Yep, this would be best. You can also rebase on your machine and then force-push into the existing branch which should update this pull request; I see this happening on GitHub a lot.

onli commented 10 years ago

Maybe this is already the answer and makes polling obsolete.

I played a bit with it. It has two drawbacks:

  1. It leads to crashes (maybe my theory is wrong, or I'm calling it wrong)
  2. It still feels sloppy in the UI, probably because doing that when idle is exactly what one normally does not want -> it should rather have priority

The main purpose of waitpid is not to provide a callback supplying the PID but to prevent the process from turning defunct ("zombie")

Oh. And yes, they definitely stayed defunct. I integrated this into the cleaned pull request: https://github.com/p-e-w/finalterm/pull/311

p-e-w commented 10 years ago

Superseded by https://github.com/p-e-w/finalterm/pull/311.