square / leakcanary

A memory leak detection library for Android.
https://square.github.io/leakcanary
Apache License 2.0
29.41k stars 3.97k forks source link

LeakCanary should surface distinct leak paths to the same instance as separate leaks #2502

Open bubenheimer opened 1 year ago

bubenheimer commented 1 year ago

Problem description

I was tracking down an Android Service leak, but the leak trace that LeakCanary gave me was not particularly helpful. It was only when I finally added my own instanceFieldLeak descriptors that I found the actual leak. (And I had to exclude not just one but two other paths before LeakCanary finally turned up the meaningful one.). There was a total of 4 distinct reference paths that I had to turn up one by one by adding instanceFieldLeak descriptors one at a time. 3 of them did not matter quite so much, as they simply caused delayed finalization. However, at heap dump time, all 4 were still present, and LeakCanary somewhat arbitrarily picked one that was not helpful.

Potential solutions

Add an option to display all leak reference paths, at the cost of log clutter.

bubenheimer commented 1 year ago

Or did LeakCanary deprioritize the most relevant leak path because it misinterpreted the situation? In this case the path included a WeakHashMap, and the leaked value was inside one of the map values. Only the keys of a WeakHashMap are weak, not the values. Does LeakCanary deprioritize a specific path just because it goes through a WeakHashMap in some way?

pyricau commented 1 year ago

Could you share your specific example, what the leaky code looked like, what paths LeakCanary was showing you and which path turned out to be useful?

LeakCanary shows a single path because, if an instance is held due to a single bug where we forgot to clear a field then you're generally guaranteed that all possible paths will go through that field. So we just pick one, and we pick the shortest to remove noise. Now, if there are multiple fields causing the leak, then you get only one path, fix it, find the leak again with another path, fix it, etc.

LeakCanary ignores WeakReferences because WeakReferences do not prevent garbage collection, they don't cause leaks, unless you constantly call get().

bubenheimer commented 1 year ago

The leaky code was in androidx.core.hardware.display.DisplayManagerCompat: https://github.com/aosp-mirror/platform_frameworks_support/blob/a9ac247af2afd4115c3eb6d16c05bc92737d6305/compat/src/main/java/androidx/core/hardware/display/DisplayManagerCompat.java#L34-L35

The problem here was that the values of a WeakHashMap transitively referenced the keys (android.content.Context), causing the Contexts to leak.

I was using DisplayManagerCompat from a MediaBrowserServiceCompat instance, causing the Service/Context to leak. However, the actual leak only appeared after excluding two other patterns that were not real leaks. Once I threw out DisplayManagerCompat there was no more leak.

I created an inaccurate bug report against MediaBrowserServiceCompat years ago because of this. It was ignored, possibly because of the invalid leak trace. I was only aware of the first trace below at the time.

Here are the LeakCanary logs, in order of successive pattern exclusions:

Pseudo-leak no. 1:

====================================
HEAP ANALYSIS RESULT
====================================
1 APPLICATION LEAKS

References underlined with "~~~" are likely causes.
Learn more at https://squ.re/leaks.

Signature: 881c4404fc048402f456764f134ed07390863849
┬───
│ GC Root: Global variable in native code
│
├─ android.os.Handler$MessengerImpl instance
│    Leaking: UNKNOWN
│    ↓ Handler$MessengerImpl.this$0
│                            ~~~~~~
├─ androidx.media.MediaBrowserServiceCompat$ServiceHandler instance
│    Leaking: UNKNOWN
│    this$0 instance of com.bubenheimer.rucksack.d.CW
│    ↓ MediaBrowserServiceCompat$ServiceHandler.this$0
│                                               ~~~~~~
╰→ com.bubenheimer.rucksack.d.CW instance
​     Leaking: YES (ObjectWatcher was watching this because com.bubenheimer.rucksack.d.CW received Service#onDestroy()
​     callback and Service not held by ActivityThread)
​     key = 0cda35fa-fa79-4035-89ca-6ab496bd1ddc
​     watchDurationMillis = 5400
​     retainedDurationMillis = 392
​     mApplication instance of com.bubenheimer.rucksack.d.D
​     mBase instance of android.app.ContextImpl
====================================
0 LIBRARY LEAKS

A Library Leak is a leak caused by a known bug in 3rd party code that you do not have control over.
See https://square.github.io/leakcanary/fundamentals-how-leakcanary-works/#4-categorizing-leaks
====================================
0 UNREACHABLE OBJECTS

An unreachable object is still in memory but LeakCanary could not find a strong reference path
from GC roots.
====================================
METADATA

Please include this in bug reports and Stack Overflow questions.

Build.VERSION.SDK_INT: 33
Build.MANUFACTURER: Google
LeakCanary version: 2.10
App process name: com.bubenheimer.rucksack
Class count: 28324
Instance count: 179717
Primitive array count: 132390
Object array count: 25025
Thread count: 39
Heap total bytes: 27055417
Bitmap count: 0
Bitmap total bytes: 0
Large bitmap count: 0
Large bitmap total bytes: 0
Db 1: open /data/user/0/com.bubenheimer.rucksack/databases/com.google.android.datatransport.events
Db 2: open /data/user/0/com.bubenheimer.rucksack/databases/rucksack-5031356898991293088.db
Stats: LruCache[maxSize=3000,hits=32348,misses=41728,hitRate=43%]
RandomAccess[bytes=1991236,reads=41728,travel=35597905162,range=32212305,size=40115227]
Heap dump reason: 1 retained objects, app is visible
Analysis duration: 16991 ms
Heap dump file path: /storage/emulated/0/Download/leakcanary-com.bubenheimer.rucksack/2023-04-25_11-53-35_598.hprof
Heap dump timestamp: 1682438042701
Heap dump duration: 4546 ms
====================================

After excluding this pattern I saw pseudo-leak no. 2:

====================================
HEAP ANALYSIS RESULT
====================================
1 APPLICATION LEAKS

References underlined with "~~~" are likely causes.
Learn more at https://squ.re/leaks.

Signature: bd2544d6cfbec1945913f9cc40eef66629d55e5c
┬───
│ GC Root: Global variable in native code
│
├─ android.os.Handler$MessengerImpl instance
│    Leaking: UNKNOWN
│    ↓ Handler$MessengerImpl.this$0
│                            ~~~~~~
├─ androidx.media.MediaBrowserServiceCompat$ServiceHandler instance
│    Leaking: UNKNOWN
│    this$0 instance of com.bubenheimer.rucksack.d.CW
│    ↓ MediaBrowserServiceCompat$ServiceHandler.mServiceBinderImpl
│                                               ~~~~~~~~~~~~~~~~~~
├─ androidx.media.MediaBrowserServiceCompat$ServiceBinderImpl instance
│    Leaking: UNKNOWN
│    this$0 instance of com.bubenheimer.rucksack.d.CW
│    ↓ MediaBrowserServiceCompat$ServiceBinderImpl.this$0
│                                                  ~~~~~~
╰→ com.bubenheimer.rucksack.d.CW instance
​     Leaking: YES (ObjectWatcher was watching this because com.bubenheimer.rucksack.d.CW received Service#onDestroy()
​     callback and Service not held by ActivityThread)
​     key = 94bd2860-a839-43cc-a6b4-a39048dec104
​     watchDurationMillis = 5325
​     retainedDurationMillis = 311
​     mApplication instance of com.bubenheimer.rucksack.d.D
​     mBase instance of android.app.ContextImpl
====================================
0 LIBRARY LEAKS

A Library Leak is a leak caused by a known bug in 3rd party code that you do not have control over.
See https://square.github.io/leakcanary/fundamentals-how-leakcanary-works/#4-categorizing-leaks
====================================
0 UNREACHABLE OBJECTS

An unreachable object is still in memory but LeakCanary could not find a strong reference path
from GC roots.
====================================
METADATA

Please include this in bug reports and Stack Overflow questions.

Build.VERSION.SDK_INT: 33
Build.MANUFACTURER: Google
LeakCanary version: 2.10
App process name: com.bubenheimer.rucksack
Class count: 28230
Instance count: 179565
Primitive array count: 131995
Object array count: 24963
Thread count: 38
Heap total bytes: 26999036
Bitmap count: 0
Bitmap total bytes: 0
Large bitmap count: 0
Large bitmap total bytes: 0
Db 1: open /data/user/0/com.bubenheimer.rucksack/databases/com.google.android.datatransport.events
Db 2: open /data/user/0/com.bubenheimer.rucksack/databases/rucksack-5031356898991293088.db
Stats: LruCache[maxSize=3000,hits=43969,misses=59257,hitRate=42%]
RandomAccess[bytes=2851906,reads=59257,travel=40817397543,range=32118446,size=40035004]
Heap dump reason: 1 retained objects, app is visible
Analysis duration: 9524 ms
Heap dump file path: /storage/emulated/0/Download/leakcanary-com.bubenheimer.rucksack/2023-04-25_13-18-11_366.hprof
Heap dump timestamp: 1682443116729
Heap dump duration: 3372 ms
====================================

After excluding this pattern I saw the actual leak:

====================================
HEAP ANALYSIS RESULT
====================================
1 APPLICATION LEAKS

References underlined with "~~~" are likely causes.
Learn more at https://squ.re/leaks.

Signature: b87288fdb06a64ce8ae25a6e2ee72be916be84a3
┬───
│ GC Root: Thread object
│
├─ android.net.ConnectivityThread instance
│    Leaking: NO (PathClassLoader↓ is not leaking)
│    Thread name: 'ConnectivityThread'
│    ↓ Thread.contextClassLoader
├─ dalvik.system.PathClassLoader instance
│    Leaking: NO (DisplayManagerCompat↓ is not leaking and A ClassLoader is never leaking)
│    ↓ ClassLoader.runtimeInternalObjects
├─ java.lang.Object[] array
│    Leaking: NO (DisplayManagerCompat↓ is not leaking)
│    ↓ Object[955]
├─ androidx.core.hardware.display.DisplayManagerCompat class
│    Leaking: NO (a class is never leaking)
│    ↓ static DisplayManagerCompat.sInstances
│                                  ~~~~~~~~~~
├─ java.util.WeakHashMap instance
│    Leaking: UNKNOWN
│    ↓ WeakHashMap.table
│                  ~~~~~
├─ java.util.WeakHashMap$Entry[] array
│    Leaking: UNKNOWN
│    ↓ WeakHashMap$Entry[0]
│                       ~~~
├─ java.util.WeakHashMap$Entry instance
│    Leaking: UNKNOWN
│    referent instance of com.bubenheimer.rucksack.d.CW
│    ↓ WeakHashMap$Entry.value
│                        ~~~~~
├─ androidx.core.hardware.display.DisplayManagerCompat instance
│    Leaking: UNKNOWN
│    mContext instance of com.bubenheimer.rucksack.d.CW
│    ↓ DisplayManagerCompat.mContext
│                           ~~~~~~~~
╰→ com.bubenheimer.rucksack.d.CW instance
​     Leaking: YES (ObjectWatcher was watching this because com.bubenheimer.rucksack.d.CW received Service#onDestroy()
​     callback and Service not held by ActivityThread)
​     key = a4f52566-50ae-40ca-a824-5088d4f860ff
​     watchDurationMillis = 5383
​     retainedDurationMillis = 373
​     mApplication instance of com.bubenheimer.rucksack.d.D
​     mBase instance of android.app.ContextImpl
====================================
0 LIBRARY LEAKS

A Library Leak is a leak caused by a known bug in 3rd party code that you do not have control over.
See https://square.github.io/leakcanary/fundamentals-how-leakcanary-works/#4-categorizing-leaks
====================================
0 UNREACHABLE OBJECTS

An unreachable object is still in memory but LeakCanary could not find a strong reference path
from GC roots.
====================================
METADATA

Please include this in bug reports and Stack Overflow questions.

Build.VERSION.SDK_INT: 33
Build.MANUFACTURER: Google
LeakCanary version: 2.10
App process name: com.bubenheimer.rucksack
Class count: 28324
Instance count: 179379
Primitive array count: 132254
Object array count: 24990
Thread count: 40
Heap total bytes: 27086589
Bitmap count: 0
Bitmap total bytes: 0
Large bitmap count: 0
Large bitmap total bytes: 0
Db 1: open /data/user/0/com.bubenheimer.rucksack/databases/com.google.android.datatransport.events
Db 2: open /data/user/0/com.bubenheimer.rucksack/databases/rucksack-11196846344382804225.db
Stats: LruCache[maxSize=3000,hits=59221,misses=110080,hitRate=34%]
RandomAccess[bytes=5208357,reads=110080,travel=53383919647,range=32215420,size=40138057]
Heap dump reason: 1 retained objects, app is visible
Analysis duration: 12077 ms
Heap dump file path: /storage/emulated/0/Download/leakcanary-com.bubenheimer.rucksack/2023-04-25_13-28-14_274.hprof
Heap dump timestamp: 1682443720195
Heap dump duration: 6902 ms
====================================

After excluding this pattern, I saw yet another pseudo-leak:

====================================
HEAP ANALYSIS RESULT
====================================
1 APPLICATION LEAKS

References underlined with "~~~" are likely causes.
Learn more at https://squ.re/leaks.

Signature: 80f093d135adc2b278b02ca2f66fbd46aaa36bea
┬───
│ GC Root: Global variable in native code
│
├─ android.support.v4.media.session.MediaSessionCompat$MediaSessionImplApi21$ExtraSession instance
│    Leaking: UNKNOWN
│    ↓ MediaSessionCompat$MediaSessionImplApi21$ExtraSession.this$0
│                                                            ~~~~~~
├─ android.support.v4.media.session.MediaSessionCompat$MediaSessionImplApi29 instance
│    Leaking: UNKNOWN
│    ↓ MediaSessionCompat$MediaSessionImplApi21.mSessionFwk
│                                               ~~~~~~~~~~~
├─ android.media.session.MediaSession instance
│    Leaking: UNKNOWN
│    mContext instance of com.bubenheimer.rucksack.d.CW
│    ↓ MediaSession.mContext
│                   ~~~~~~~~
╰→ com.bubenheimer.rucksack.d.CW instance
​     Leaking: YES (ObjectWatcher was watching this because com.bubenheimer.rucksack.d.CW received Service#onDestroy()
​     callback and Service not held by ActivityThread)
​     key = 1579d3c8-c5eb-48af-8fd2-c2f56aafeee0
​     watchDurationMillis = 5296
​     retainedDurationMillis = 291
​     mApplication instance of com.bubenheimer.rucksack.d.D
​     mBase instance of android.app.ContextImpl
====================================
0 LIBRARY LEAKS

A Library Leak is a leak caused by a known bug in 3rd party code that you do not have control over.
See https://square.github.io/leakcanary/fundamentals-how-leakcanary-works/#4-categorizing-leaks
====================================
0 UNREACHABLE OBJECTS

An unreachable object is still in memory but LeakCanary could not find a strong reference path
from GC roots.
====================================
METADATA

Please include this in bug reports and Stack Overflow questions.

Build.VERSION.SDK_INT: 33
Build.MANUFACTURER: Google
LeakCanary version: 2.10
App process name: com.bubenheimer.rucksack
Class count: 28160
Instance count: 179043
Primitive array count: 131777
Object array count: 24898
Thread count: 38
Heap total bytes: 26946918
Bitmap count: 0
Bitmap total bytes: 0
Large bitmap count: 0
Large bitmap total bytes: 0
Db 1: open /data/user/0/com.bubenheimer.rucksack/databases/com.google.android.datatransport.events
Db 2: open /data/user/0/com.bubenheimer.rucksack/databases/rucksack-11196846344382804225.db
Stats: LruCache[maxSize=3000,hits=43936,misses=59219,hitRate=42%]
RandomAccess[bytes=2850070,reads=59219,travel=40826114938,range=32048130,size=39959274]
Heap dump reason: 1 retained objects, app is visible
Analysis duration: 9645 ms
Heap dump file path: /storage/emulated/0/Download/leakcanary-com.bubenheimer.rucksack/2023-04-25_13-53-41_173.hprof
Heap dump timestamp: 1682445242318
Heap dump duration: 3480 ms
====================================
bubenheimer commented 1 year ago

Oh, and actually, those pseudo-leaks did not actually cause delayed finalization, finalization behavior was normal after addressing the real leak. I don't know what caused the pseudo-leaks to show up, maybe some special ART machinery for Services/Binders/Messengers/Handlers? Or whatever strange behavior might be tied to the "global variable in native code" instances.

pyricau commented 1 year ago

Thanks, that's super helpful.

Let me reason through this, I'll try to restate what you wrote, let me know if I got this wrong.

Here's the reasoning:

Implementation note: The value objects in a WeakHashMap are held by ordinary strong references. Thus care should be taken to ensure that value objects do not strongly refer to their own keys, either directly or indirectly, since that will prevent the keys from being discarded.

This is actually a common leak pattern in the Android framework: binders are held until the other side runs garbage collection and finalizes the binder on its end, no matter what we're doing with service lifecycle (that's because binders are used beyond services, the only way to be certain the other side won't call is when it doesn't have that pointer). The proper way to handler this is for the service code to finish its handler in onDestroy, or set the reference from the handler to the service to null in onDestroy.

The first and second leak traces are identical, the problem is that MediaBrowserServiceCompat$ServiceHandler has several direct and indirect references to the service.

The last leak is the exact same issue elsewhere, ExtraSession is a stub that won't be GCed until the other side has run its own GC https://github.com/aosp-mirror/platform_frameworks_support/blob/a9ac247af2afd4115c3eb6d16c05bc92737d6305/media/src/main/java/android/support/v4/media/session/MediaSessionCompat.java#L3413

So these 2 leaks are binder related leaks, and the leaks stay in place until a GC runs in the calling process, and you have no control over that. They really should be fixed.

Summary

You actually found 3 leaks in the framework code, 2 of which are transient and depend on GCs in other processes. You should file issues with Google.

Based on that, the problem statement is slightly different: you don't actually want to see all leak paths as there as a lot more than just the 5 you found (you only found 5 by excluding specific references).

What you really need is to see all distinct leak paths, where "distinct" only applies to a unique sub path that's made of suspect references.

In other words, ideally LeakCanary would have presented you with exactly 3 distinct leaks, since there are 3 logic bugs, even though a single service instance was leaking in the end.

Thinking about multi leaks

If there are N logic errors causing N bad references but all N end up leaking the same instance, we ideally should see N leaks.

The core problem is we don't know which references in a path are bad. We already have logic to figure which references are definitely not bad, from which we deduce the maybe bad. We use this once we've found a path, not as we traverse the graph.

So this would require a new decision point: currently, when we find an object that we've already visited, we ignore it, as we know we've already computed a better / shortest path to it. Here we'd need to approach this slightly differently and consider whether we just found a distinct / valuable new path to that object, or not. Keeping in my that "object" here isn't just instances we're tracking as leaking but really any object in the heap. Or maybe any object marked as leaking. Which requires evaluating whether objects are leaking during graph traversal, which could be a bit more expensive, even with some caching.

=> if I run into an object that I've already ran into, and that object is known as "leaking", then I compare my current path with the last path, but only the "unknown" parts of it (ignoring the "non leaking parts at the root). If the 2 paths are entirely distinct then we know there are distinct leaks and the new path should be stored as one of the alternative leaky paths.

This isn't all good yet because we don't actually keep track of the leaky paths while they're building, they'll all just stored in the priority queue. So you can't easily store this, but you now sorta need to keep track of the wonky objects that have multiple possible leaky paths.

Conclusion

This would likely require deep changes to the path finding algorithms, but it could yield a significant improvement in that LeakCanary could now surface all actual leak paths.

bubenheimer commented 1 year ago

Thank you for the elaboration. Sorry for my delayed response, I got swamped. Your restated reasoning looks sound.

I did create an issue for the DisplayManagerCompat problem a month ago; I assume the cache is supposed to offer a performance boost: https://issuetracker.google.com/issues/279625765

I want to create issues for the other leaks, too, but I will need to look at your analysis more closely. I'd say the issues that I file against the framework are commonly ignored and closed, so I usually regret spending the time.

I do think it's worth investing in the more advanced leak analysis that you are proposing: I only found the DisplayManagerCompat problem 6 years after I was first affected by it, and really only because I decided to allocate a boatload of time on leak analysis in this app; and in the end I still only found it because I got lucky with somewhat randomly going down the road of excluding leak paths.

bubenheimer commented 1 year ago

I've added your analysis to the Google issue I already had open for this, fingers crossed for a fix: https://issuetracker.google.com/issues/37137738

pyricau commented 1 year ago

ooo looks like there's traction that's great!

bubenheimer commented 1 year ago

@pyricau any bandwidth for chiming in on Google's fix?

https://issuetracker.google.com/issues/37137738#comment22

kelmer44 commented 7 months ago

After excluding this pattern I saw pseudo-leak no. 2:

I happened into this thread and was wondering how does one "exclude a pattern"?

Im having a similar issue and suspect maybe the leak report is choosing the wrong leak reference path

see thread https://github.com/androidx/media/issues/1059#issuecomment-2031345189