gtkd-developers / GtkD

GtkD is a D binding and OO wrapper of GTK+ originally created by Antonio Monteiro
http://gtkd.org
Other
322 stars 71 forks source link

Memory leak with widgets that have event handlers #254

Closed gnunn1 closed 6 years ago

gnunn1 commented 6 years ago

I had a Tilix user report a pretty serious memory leak with the Tilix sidebar here: https://github.com/gnunn1/tilix/issues/1558

After debugging it I've determined that the issue happens when a widget has an event handler associated with it. This seems to prevent the D GC from reclaiming the object and freeing it. If there is no event handlers in the widget everything works fine.

I've created a simple reproducer for you so you don't have to deal with the large Tilix code base to track this down. The reproducer is located here:

https://github.com/gnunn1/gctest

To reproduce the issue, compile the application using the following command:

dub build --config=trace

Notice in the destructor of the ListBoxRow I have added a line to log when the D destructor is called here:

https://github.com/gnunn1/gctest/blob/master/source/gctest.d#L129

After compiling the program run it in a terminal so you can see the output. The application shows a window with a headerbar and a listbox with some rows in it. In the headerbar are two buttons: "Recreate Rows" and "GC". If you click the "Recreate Rows" button it will remove the existing rows and replace them with a new set. Clicking the "GC" button invokes D's GC.

Press the "Recreate Rows" once and then press the "GC" button a couple of times, notice there is no output from the destructor.

Next, comment out the event handler for the button in the GCListBoxRow at line 129 (https://github.com/gnunn1/gctest/blob/master/source/gctest.d#L129) and re-compile it and then run it again. Repeat the same exercise as above and you will see that the destructor is now being invoked.

If you need any additional information just let mw know.

MikeWey commented 6 years ago

It looks like the GC.addRoot in the DClosure that is keeping the delegate alive for as long as the CLosure is alive, is causing the issue.

gnunn1 commented 6 years ago

Thanks, I can see that now. As a workaround if I explicitly call the destroy method that will cause it to be GC'ed.

One thing I was wondering about, rather then doing GC.addRoot, would it be possible for ObjectG to manage references to the DClosure itself and thus tie it's lifecycle to the object? For example, I was thinking something along these lines pseudo code wise:

class ObjectG {
  private DClosure closures[];

  ...
}

And then in Signals.connect:

public static gulong connect(T)(ObjectG instance, string detailedSignal, T callback, ConnectFlags connectFlags = cast(ConnectFlags)0)
    if ( isCallable!T && !is(T == GCallback) )
    {
        bool after = (connectFlags & ConnectFlags.AFTER) != false;
        bool swap = (connectFlags & ConnectFlags.SWAPPED) != false;
                DClosure closure = new DClosure(callback, swap);
                instance.closures ~=closure;

        return Signals.connectClosure(instance, detailedSignal, closure, after);
    }

Obviously signals disconnect would need to be updated to remove the closure from the object instance as well.

In this way the lifecycle of the closures are tied to the object instance instead of separate instance entities. I'm not knowledgeable about this as you but does that make any kind of sense?

MikeWey commented 6 years ago

Commit c1ed569 does exactly that, it fixes the memory issue for the small test case. Does it work for Tilix?

gnunn1 commented 6 years ago

I've tried the change, while tilix initially runs unfortunately I am getting segfaults in other event handlers. Specifically if I run Tilix and then show the Preferences dialog it triggers a FocusOut event which is handled by the following delegate:

https://github.com/gnunn1/tilix/blob/master/source/gx/tilix/appwindow.d#L1732

The backtracebelow from gdb shows it failing on the check for gsSettings. If I insert some debug code into the delgate I can see the delegate itself seems to be there but the reference to gsSettings is bad. A check for null on it passes but then then fails at line 1733 where it tries to load the boolean value.

gsSettings is a member of the AppWindow class and it shouldn't be getting GC'ed since the AppWindow instance is holding a reference to it.

Thread 1 "tilix" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in  ()
#1  0x000055555592aff4 in _D2gx5tilix9appwindow9AppWindow6__ctorMFC3gtk11ApplicationQnbZ12__dgliteral7MFC3gdk5EventQgCQBz6WidgetQhZb
    (e=0x7ffff76243a0, widget=0x7ffff7584400) at /home/gnunn/Development/dlang/tilix/source/gx/tilix/appwindow.d:1733
#2  0x0000555555b6a41a in _D7gobject8DClosureQj__T17d_closure_marshalTDFC3gdk5EventQgC3gtk6WidgetQhZbZQCaUPSQDc1c5types8GClosurePSQDyQwQw6GValuekQrPvQcZv
    (closure=0x555556739210, return_value=0x7fffffffda10, n_param_values=2, param_values=0x7ffff75dca40, invocation_hint=0x7fffffffd9f0 "J", marshal_data=0x0) at DClosure.d:114
#3  0x00007ffff6c263d5 in g_closure_invoke () at /usr/lib/libgobject-2.0.so.0
#4  0x00007ffff6c12c78 in  () at /usr/lib/libgobject-2.0.so.0
#5  0x00007ffff6c165ed in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0
#6  0x00007ffff6c17a80 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0
#7  0x00007ffff59431f5 in  () at /usr/lib/libgtk-3.so.0
#8  0x00007ffff57fd0d8 in gtk_main_do_event () at /usr/lib/libgtk-3.so.0
#9  0x00007ffff54faa06 in  () at /usr/lib/libgdk-3.so.0
#10 0x00007ffff552c755 in  () at /usr/lib/libgdk-3.so.0
#11 0x00007ffff6b3f3cf in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#12 0x00007ffff6b40f89 in  () at /usr/lib/libglib-2.0.so.0
#13 0x00007ffff6b40fce in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0
#14 0x00007ffff6ce57ee in g_application_run () at /usr/lib/libgio-2.0.so.0
#15 0x0000555555a92f96 in _D3gio11ApplicationQn3runMFAAyaZi (this=0x7ffff7583200, argv=...) at Application.d:931
#16 0x000055555593f6b5 in D main (args=...) at /home/gnunn/Development/dlang/tilix/source/app.d:152

Also, when I compile Tilix as a debug build, I insert a menu entry that calls the D garbage collector:

core.memory.GC.collect();

This generates the following backtrace in gdb:

Thread 1 "tilix" received signal SIGSEGV, Segmentation fault.
0x000055555592af48 in _D2gx5tilix9appwindow9AppWindow6__ctorMFC3gtk11ApplicationQnbZ__T12__dgliteral6TCQBo6WidgetQhZQBdMFPS3gdk1c5types12GdkRectangleQBvZv (rect=0x7fffffffd510, Widget=0x7ffff7584400) at /home/gnunn/Development/dlang/tilix/source/gx/tilix/appwindow.d:1724
1724                        isBGImage.destroy();
(gdb) bt
#0  0x000055555592af48 in _D2gx5tilix9appwindow9AppWindow6__ctorMFC3gtk11ApplicationQnbZ__T12__dgliteral6TCQBo6WidgetQhZQBdMFPS3gdk1c5types12GdkRectangleQBvZv (rect=0x7fffffffd510, Widget=0x7ffff7584400) at /home/gnunn/Development/dlang/tilix/source/gx/tilix/appwindow.d:1724
#1  0x0000555555b79e3d in _D7gobject8DClosureQj__T17d_closure_marshalTDFPS3gdk1c5types12GdkRectangleC3gtk6WidgetQhZvZQCpUPSQDrQBwQBx8GClosurePSQElQCqQCr6GValuekQtPvQcZv
    (closure=0x5555567389e0, return_value=0x0, n_param_values=2, param_values=0x7ffff757d0c0, invocation_hint=0x7fffffffd1a0 "*", marshal_data=0x0)
    at DClosure.d:114
#2  0x00007ffff6c263d5 in g_closure_invoke () at /usr/lib/libgobject-2.0.so.0
#3  0x00007ffff6c12c78 in  () at /usr/lib/libgobject-2.0.so.0
#4  0x00007ffff6c1701e in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0
#5  0x00007ffff6c17a80 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0
#6  0x00007ffff594d740 in gtk_widget_size_allocate_with_baseline () at /usr/lib/libgtk-3.so.0
#7  0x00007ffff595e6a7 in  () at /usr/lib/libgtk-3.so.0
#8  0x00007ffff6c1716e in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0
#9  0x00007ffff6c17a80 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0
#10 0x00007ffff57289d1 in  () at /usr/lib/libgtk-3.so.0
#11 0x00007ffff6c263d5 in g_closure_invoke () at /usr/lib/libgobject-2.0.so.0
#12 0x00007ffff6c13195 in  () at /usr/lib/libgobject-2.0.so.0
#13 0x00007ffff6c1701e in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0
#14 0x00007ffff6c17a80 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0
#15 0x00007ffff550431b in  () at /usr/lib/libgdk-3.so.0
#16 0x00007ffff54eeb2b in  () at /usr/lib/libgdk-3.so.0
#17 0x00007ffff6b3eb63 in  () at /usr/lib/libglib-2.0.so.0
#18 0x00007ffff6b3f271 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0
#19 0x00007ffff6b40f89 in  () at /usr/lib/libglib-2.0.so.0
#20 0x00007ffff6b40fce in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0
#21 0x00007ffff6ce57ee in g_application_run () at /usr/lib/libgio-2.0.so.0
#22 0x0000555555a92f96 in _D3gio11ApplicationQn3runMFAAyaZi (this=0x7ffff7583200, argv=...) at Application.d:931
#23 0x000055555593f6b5 in D main (args=...) at /home/gnunn/Development/dlang/tilix/source/app.d:152

Similar to the onFocusOut handler, it looks like it is segfaulting a referencing a variable that is scoped outside the delegate.

gnunn1 commented 6 years ago

I've updated the gctest repo to reproduce the problem, simply trying to reference lblText in the button click handler causes the same issue. If you press the GC button a couple of times and then click the button in one of the rows you will see it segfault.

https://github.com/gnunn1/gctest/blob/master/source/gctest.d#L121

MikeWey commented 6 years ago

I've added addRange and removeRange to the DClosure so the D GC will scan the struct that was allocated by GLib, this fixes the segfaults but the original issue returns.

addRange is a slight improvement over the previous way that addRoot was used since we now the GC will only scan the DClosure when it's referenced from D.

For the original issue: GtkD hold a C reference on the object and adds it as a GC root when the refcount is grater or equal to 2. When the test adds the new rows the refcount of the old rows go's down to 1, removing it as a GC root and the GC should than be free to collect it. Only the ListBoxRow still holds a reference to the added Button, the refCount for the button is now 2 and it's still a GC root. The Button also holds the DClosure that is used for the onClicked signal, the DClosure holds the delegate used, and the delegate references the ListBoxRow keeping it alive.

gnunn1 commented 6 years ago

I'm not sure I'm quite following, why is the button ref count 2?

Is it because the ListBoxRow holds a reference to it as it's parent or because the closure does? If it's a case of the ListBoxRow holding a reference to the button, why isn't there the same issue when no signals are involved? If it's a case of the closure holding a reference to it, why is the closure incrementing the reference of the button, does it need to or should it be using something like a weak reference?

Also, looking at the source code at the line below, why do we need to remove root from the GClosure, I'm not seeing where it gets added to root? It doesn't look like there is a corresponding ObjectG for it where the root would be added and doing a search of the code the only addRoot I can see is in ObjectG though I may be missing something.

https://github.com/gtkd-developers/GtkD/blob/master/generated/gtkd/gobject/DClosure.d#L101

DClosure does wrap GClosure but it doesn't descend from ObjectG as far as I can tell and thus I'm not seeing how it gets added to root.

MikeWey commented 6 years ago

For the button, both the listboxrow and gtkd are holding a reference. The issue only heapens with signals because the bottom has a DClosure which hold a delegate which references the listbox row. And since the button is a GC root when its reference count is grater then 1, we have a cycle with one GC root in it.

Without the signal the gc would collect the listboxrow, bringing the ref count for the button down to 1, removing it as a root and then it can be collected.

We need to either store a reference to a child widget in its parent and not add it as a GC root. Or we need some way to break the cycle when the ref count of the container go's down to 1.

The remove root in the destructor should have been removeRange.

gnunn1 commented 6 years ago

I'm still a little confused (not unusual!), I'm not seeing why adding the event (signal) makes a difference to it being GC'ed. I may not be following what you are saying above, but I'm not seeing how the reference count on the button being held by the listboxrow is a problem since that should be the same regardless of whether we have the event or not. I'm also not seeing the issue as being the delegate holding a reference to the lisboxrow, the GC should be able to collect a graph of objects so once the listboxrow goes out of scope everything associated with it should be collected.

Instead I think the issue is the addRange in DClosure preventing the GC and I'm not sure why this is needed. It looks like you are preventing the memory associated with gGlosure from being GC'ed by using addRange however the base Closure class already has a reference to the pointer. I guess that's not enough to prevent the memory from being GC'ed and if that's the case is there a way to simply maintain a reference to the block of memory as an instance variable instead of using addRange?

The whole idea I was trying to get at in my original comment is to not use addRoot/addRange for DClosure but rather tie it to the ObjectG lifecycle via instance variables. I was thinking this would allow the GC to naturally collect the graph of objects once the ObjectG expires via it's own ref counting and the removeRoot that your are managing.

gnunn1 commented 6 years ago

And looking at the code some more for DClosure, at this line here a GClosure is allocated:

https://github.com/gtkd-developers/GtkD/blob/master/generated/gtkd/gobject/DClosure.d#L85

Is the variable here shadowing the gClosure field in the base Closure class?

https://github.com/gtkd-developers/GtkD/blob/master/generated/gtkd/gobject/Closure.d#L84

gnunn1 commented 6 years ago

Ignore that last comment, I didn't follow it all the way through where it was getting passed in the super and set it in the base class.

I also tried removing the addRange and removeRange, worked fine in gctest but immediately started segfaulting in tilix. Looking into it a bit more.

MikeWey commented 6 years ago

I'm still a little confused (not unusual!), I'm not seeing why adding the event (signal) makes a difference to it being GC'ed. I may not be following what you are saying above, but I'm not seeing how the reference count on the button being held by the listboxrow is a problem since that should be the same regardless of whether we have the event or not. I'm also not seeing the issue as being the delegate holding a reference to the lisboxrow, the GC should be able to collect a graph of objects so once the listboxrow goes out of scope everything associated with it should be collected.

The biggest issue is that GtkD will make the button a GC root when it's ref count is grater than 2, this is to make sure the Button class isn't collected while it's still used by GTK and it's signal handlers might be called. The GC won't collect a root or anything referenced by a root.

Instead I think the issue is the addRange in DClosure preventing the GC and I'm not sure why this is needed. It looks like you are preventing the memory associated with gGlosure from being GC'ed by using addRange however the base Closure class already has a reference to the pointer. I guess that's not enough to prevent the memory from being GC'ed and if that's the case is there a way to simply maintain a reference to the block of memory as an instance variable instead of using addRange?

Without the addRange the delegate was collected while it was still in use, because the GC doesn't scan memory allocated by C / GTK. My understanding that a range wasn't treated as a root does seem to be incorrect looking at the GC implementation. Only removing the add/remove range and storing a reference to the delegate directly in the class doesn't fix the issue.

The whole idea I was trying to get at in my original comment is to not use addRoot/addRange for DClosure but rather tie it to the ObjectG lifecycle via instance variables. I was thinking this would allow the GC to naturally collect the graph of objects once the ObjectG expires via it's own ref counting and the removeRoot that your are managing.

The idea of the add/remove root is to tell the GC to scan the struct allocated by GTK for as long as the DClosure is alive hence removing it in the destructor. The lifecycle of the delegate is tied to the lifecycle of DClosure (using addRange), and the lifecycle of the DClosure is tied to the object it's connected to. (the button in this case.)

gnunn1 commented 6 years ago

Thanks for the explanation. When I comment out the addRange and removeRange, in gctest everything gets GC'ed correctly. Now obviously in tilix it's doing GC too aggressively and I suspect it's GC'ing the anonymous delegate. But the point is if the button being ref counted to 2 and having an addRoot was the issue, commenting out addRange in DClosure would not have any effect at on the GC. That's my thinking anyway.

If I understand correctly, the addRange is protecting the anonymous delegate which gets GC'ed and hence the segfault when the callback is triggered. The callback is added to this line:

auto dClosure = cast(DGClosure!(T)*)gClosure;
dClosure.callback = callback;

However dClosure is scoped to the method, so once it goes out of scope so does the reference to the anonymous delegate. Would adding an instance variable for the delegate to the DClosure class potentially help with this? I'm fooling around with it but struggling a bit with trying to create a generic delegate reference variable as void* dlgRef won't work. Sometimes being a Java developer is tough :)

Also, if I'm going way off a tangent and it's getting annoying just let me know, no worries :)

MikeWey commented 6 years ago

If I understand correctly, the addRange is protecting the anonymous delegate which gets GC'ed and hence the segfault when the callback is triggered.

Correct :)

Would adding an instance variable for the delegate to the DClosure class potentially help with this? I'm fooling around with it but struggling a bit with trying to create a generic delegate reference variable as void* dlgRef won't work. Sometimes being a Java developer is tough :)

private void delegate() dlg;

...

dlg = cast(void delegate())callback;

Only that doesn't fix the issue, the ListBoxRow and the buttons are still not collected.

gnunn1 commented 6 years ago

Yeah, was on IRC asking about this and someone pointed me to saving the point in void* via callback.ptr, I see the same behavior as you do though, it works exactly like addRange and nothing is GC'ed. It surprises me because I would have thought the reference in the DClosure would allow the GC to collect everything as a graph, I'm a bit stumped why that's not working?

I'm assuming we can't use a GClosure finalizer or notification of destruction because it's refcount never changes in this situation correct? This is why ObjectG.destroy works since it cleans up all the event handlers and gclosures by artificially setting the refcounts to 0.

I'm out of ideas at the moment, maybe if I let it percolate a bit something will come up.

MikeWey commented 6 years ago

If i diagnosed things correctly we currently have this cycle:

ListBoxRow -> Button -> DClosure -> Delegate -> ListBoxRow

Where the Button is currently a GC root.

gnunn1 commented 6 years ago

I guess that's where I'm confused, why does the DClosure make a difference, isn't the Button always have addRoot or is DClosure bumping the buttons reference count? I'm not seeing the connection between the two.

i.e. Whether it is this:

ListBoxRow -> Button -> DClosure -> Delegate -> ListBoxRow

Or this:

ListBoxRow -> Button

The reference count on the button is the same either way, what am I missing?

MikeWey commented 6 years ago

The refCount of the ListBoxRow is 1, so the GC should be able to collect it, when that happens, the reference it holds on the button will be removed, so the GC should be able to collect the Button in the next cycle.

My thought was that the hidden this pointer of the delegate references the ListBoxRow keeping the GC from collecting it, and when the ListBoxRow isn't collected the Button will say a GC root.

but i'm not sure anymore, i've changed to code so that widgets that are added to a container are no longer a GC root, and i changed DClosure to store the reference to the delegate in a void* field instead of using add/remove range. And the issue of not collecting the ListBoxRow persists.

MikeWey commented 6 years ago

With these changes: https://gist.github.com/MikeWey/db74966fd4293b5ab7a72047d97a934a The test case seems to working properly.

The first halve of this blog post about GJS also gives a good explanation on how we currently handle the GC and the refCouts in GtkD: https://ptomato.wordpress.com/2018/11/06/taking-out-the-garbage/

gnunn1 commented 6 years ago

I tried your changes in gctest and it worked well there, however in Tilix while there is no segfault the listboxrow destructor never seems to get called even when I explicitly call destroy. Having said that I'm running out of steam tonite and I may be missing something with all the other fooling around I've been doing in this area, I'll look into it in more detail in the next day or two. Thanks for all your effort on this.

gnunn1 commented 6 years ago

It looks like your code does indeed fix the problem. The issue in Tilix is my SideBarListRow holds a reference to the SideBar which various delegates use to call methods on the sidebar. If I comment out the references to the sidebar it works fine, seems like this is preventing the GC from doing it's work.

I'll look at changing my code to fix this. If you merge your change into GtkD master I'll build an updated version against it and test it out for a week or so. If everything looks good I'll ping you to release a new proper version of GtkD into Dub. Does that make sense to you?

MikeWey commented 6 years ago

Sounds good.

MikeWey commented 6 years ago

The changes are in GtkD 3.8.4, hopefully you can find whats going on with the SideBarListRow.