Closed jbphet closed 8 years ago
During the developer meeting it was suggested that I profile using Hooke's Law 1.0, since @pixelzoom feels like that was tested for memory leaks before release. I profiled this sim by first loading it and taking a heap snapshot (all snapshots are being done using Chrome developer tools by the way), and then reloading it with fuzzMouse=100
set as a query parameter and taking snapshots every 30 seconds. The results showed that the memory does increase in this case. Here are the results:
Time (s) | Size (MB) |
---|---|
0 | 25.1 |
30 | 39.6 |
60 | 52.3 |
90 | 61.2 |
120 | 72.0 |
150 | 81.1 |
180 | 91.9 |
Here's a screenshot of the developer tools at the end of this testing:
For reference, the developer tools web site says that the number below the snapshot name is the “total size of the reachable JavaScript objects”, see https://developer.chrome.com/devtools/docs/heap-profiling.
From this, it would appear that memory allocation is indeed increasing.
@samreid may have found a smoking gun. He ran these tests on the 'blast' sim, which doesn't have much going on, and found that memory was increasing there too. He commented out the phet-menu, and found that memory allocation no longer increased. Nice detective work!
I just tested the current master version of Hooke's Law with a line of code added to PhetButton.js to make it unpickable so that the PhetMenu won't be created during fuzz testing. The memory allocation still goes up, but much more slowly than it did before, and it stays pretty close to 30MB total. Seems like PhetMenu may be responsible for the bulk of the issue. Here is a screenshot of the data from this test:
AccessiblePeer looks suspect. It's constructor has @param {DOMElement} domElement
, to which it adds listeners. It has no dispose
function, so it never removes listeners. I don't know if that's significant or related to this, but AccessiblePeer is created by PhetMenu. And as a common-code component, AccessiblePeer should definitely have a dispose
function.
Edit: AccessiblePeer does have a dispose function
. But it doesn't call removeEventListener
.
It may be related to AccessiblePeer, but I think something more basic may be going on too. I just ran a test where I put a debugger
statement in the code just before phetMenu.dispose()
is called in PhetButton
, and this line of code is never hit if I then click on the button and then click outside of the menu. If I do this several times, then take a heap snapshot and look at the count of PhetMenu objects, the number corresponds to the number of times I've clicked on the button. So it seems pretty clear that there is a path where PhetMenu is getting allocated and never disposed of.
I also tried the same test but clicking on items in the menu instead of outside of it. In this case, the dispose
function is called, but the count of PhetMenu objects still continues to increase according to the profiler. Maybe this this is where AccessiblePeer
comes into play.
Assigning to @samreid initially to have a look, since I believe he did a lot of this code originally. He can reassign as appropriate.
As for the priority of this issue, it seems like this is unlikely to be a problem for our customers, since they'd have to open the menu a lot of times in order to run into any memory issue, but it does make it hard to test sims for memory leaks since this leak would likely overshadow any others.
Marking as medium priority
Agreed with @jbphet that unlikely to be affecting users, but it does seem important to address to allows us to test for memory leaks.
@samreid @jbphet should this issue also now be moved to joist?
I was going to let @samreid make that determination. Seems like a joist issue to me, but @pixelzoom mentioned AcceissiblePeer
, which is in Scenery, so I thought a little more investigation was in order before moving the issue.
AccessiblePeer
should definitely have a completed dispose function, to be tracked in https://github.com/phetsims/scenery-phet/issues/226. But it is not a culprit for these memory leaks. AccessiblePeer
is still hidden behind the accessibility
query parameter and should never be touched in the tests mentioned in this issue.
I'll create a Joist issue for the AboutDialog memory, but let's leave this issue open until we test for other systemic issues.
We discussed the importance of tracking down all the memory leaks during the 4/12/2016 developer meeting and decided that we should determine whether the sims actually exhibit problems in a memory constrained environment, such as an iPad 2. To test this, I fuzz tested (fuzzMouse=100) Acid Base Solutions v1.2.4 on an iPad 2 model A1395 with 16 GB of memory running iOS 9.3.1 (Witten). I ran two trials, and in both cases mobile Safari crashed (disappeared from the screen) at somewhere around the 25 minute mark. So I think the answer is probably yes - we should fix the leaks.
By the way, there is no special reason for choosing Acid Base Solutions. It's the first HTML5 sim in alphabetical order, and I profiled it using Chrome dev tools and found that it was leaking memory at around the same rate as the other sims. The version that I tested is one that is published as of this writing, and has none of the recent fixes to the framework that are meant to address the memory leaks.
We should test how long master
of the same sim runs. Perhaps joist memory fixes have helped?
Assigning to @jbphet.
I built a version of Acid Base Solutions off of master as of today and re-ran the fuzz test on it using the same iPad 2 referenced in one of the comments above. The results indicate that things are better, but the testing was ultimately a bit inconclusive because the home screen icons eventually scroll out of view due to https://github.com/phetsims/joist/issues/331, and as a result fuzz testing gets stuck on the home screen. This didn't happen until roughly 50 minutes of test time, so things are definitely better (since previously Safari would crash after about 25 minutes), but this test should be re-run once the issue with the moving home screen icons is fixed.
The results indicate that things are better, but the testing was ultimately a bit inconclusive because the home screen icons eventually scroll out of view due to phetsims/joist#331, and as a result fuzz testing gets stuck on the home screen
I put a bandaid on the joist 331 issue, so you should be able to proceed with testing here.
Cool, thanks. I'll re-run the test and report the results today.
I re-ran the test on a built version of Acid Base Solutions that incorporated both the fixes for the memory leaks in common code and the 'bandaid' fix for the drifting icons on the home screen. This test was still running after 20 hours, which is a huge improvement.
Given the rate of memory leak (as measured in Chrome Dev Tools) and the fact that it took 20 minutes to crash an iPad, you should be able to extrapolate from the new rate of memory leak and project the crash time for the fixed app:
time to crash the fixed app in minutes = memory leak rate of broken app / memory leak rate of fixed app * 20
For instance, if you estimate the broken app leak rate is 1MB/min and the fixed app leak rate is 0.1MB/min, then we could extrapolate that the fixed app would crash after 1/0.1*20 = 200 minutes.
I'm somewhat interested in the numbers above, so I'll take a look.
Running 1.2.4 with ?fuzzMouse=100 Initial Memory 34.5MB, after 30 seconds 59MB (average over two runs)
For a fresh build in master and fuzzMouse=100: Initial Memory 34.5MB, after 30 seconds 35.7MB
Hence the "fixed" app still has a slow leak that would crash the iPad after roughly 24.5/1.2*20=408
minutes, which is roughly 7 hours.
Oh wait, @jbphet reported that the fixed app was running over 20 hours, so there must be something else at play here.
It's not safe to assume that if there is more memory allocated after 30 seconds there is a memory leak. My experience has been that the simulations continue to allocate memory for a while during fuzz testing, which makes sense, since there are new portions of the code being run that may allocate memory. The sim's memory allocation tends to stabilize after a few minutes once more of the code has been encountered during the testing. A better test would be to start it, let it run for two minutes, then take several memory snapshots at, say, 30 second intervals.
Fuzz test on iPad 2 on Acid Base Solutions with mem leak fix and icon shift bandaid (details of test described above) still running after 24 hours with no discernible slowdown. I'm calling the test a success and shutting it down.
still running after 24 hours with no discernible slowdown
I didn't realize that the performance also changed.
How was the memory profile after 24 hours? Leaking or not?
I didn't realize that the performance also changed.
I entered the bit about "no discernible slowdown" because during the previous test - the one where mobile Safari crashed after ~25 minutes - the sim also looked a bit slower (@jonathanolson was in the room at the time and thought so too). I wanted to make it clear for the record that, as far as I could tell, the sim was behaving normally after 24 hours.
How was the memory profile after 24 hours? Leaking or not?
The test was run on an iPad, so I wasn't collecting memory profile data. It may be possible to obtain this by tethering to a Mac, but I didn't do so. The point of the test was to make sure that memory leakage isn't bad enough to crash the browser or cause other problems when running on the most memory limited device that we support.
I re-ran the profiling test from https://github.com/phetsims/tasks/issues/546#issuecomment-198136011 above on the current master version of Hooke's Law. This was a built version. Below are the results, with the initial results included:
Time (s) | Size (MB) Before Fixes | Size (MB) After Fixes |
---|---|---|
0 | 25.1 | 24.4 |
30 | 39.6 | 25.2 |
60 | 52.3 | 25.6 |
90 | 61.2 | 25.7 |
120 | 72.0 | 25.9 |
150 | 81.1 | 26.1 |
180 | 91.9 | 26.1 |
If I assume that memory usage stabilizes at the 1 minute mark, the average leak rate before the changes was ( 91.9 - 52.3 ) / 120 = 330 kB/sec. The leak rate after the fixes was ( ( 26.1 - 25.6 ) / 120 = 4 kB/sec. There still appears to be a slow leak, but it is almost two orders of magnitude less than it was before the fixes.
I tried gathering data at the 10 minutes mark, but when this test was run on the pre-fixes version (v1.0.4), the dev tools crash when trying to take a heap snapshot. For the post-fixes version, the total memory size consumed at 10 minutes was 26.5 MB, which means it was leaking at around ( 26.5 - 26.1 ) / 420 = 952 bytes/sec from the 3 minute mark to the 10 minute mark.
Marking for developer meeting to decide whether to close this, do a bit more investigation on a simple sim, or unassign.
My current recommendation (I reserve the right to change my mind after discussion):
As part of development of new sims or when old sims are brought up to standards, this code review will help us discover and address leaks in simulations and common code.
Discussed in the 4/21/2016 developer meeting and we decided to add a screen to the "blast" sim and run this memory testing on that, then determine next steps based on the results. I will add the screen and will work with @phet-steele to gather the testing data.
I've added the screen and published the new sim to spot, see http://www.colorado.edu/physics/phet/dev/html/blast/0.0.0-dev.5/blast_en.html. I've discussed the data gathering with @phet-steele and he's all over it. Assigning to him.
Win 10 Chrome:
Time (min) | Size (MB) |
---|---|
0:00 | 16.5 |
0:30 | 17.2 |
1:00 | 17.5 |
1:30 | 17.9 |
2:00 | 18.2 |
2:30 | 17.9 |
3:00 | 18.1 |
3:30 | 18.3 |
4:00 | 18.0 |
4:30 | 18.1 |
5:00 | 18.4 |
5:30 | 18.5 |
6:00 | 18.2 |
6:30 | 18.6 |
7:00 | 18.5 |
7:30 | 18.5 |
8:00 | 18.6 |
8:30 | 18.6 |
9:00 | 18.3 |
9:30 | 18.6 |
10:00 | 18.3 |
I know everyone was yearning for a graphical representation:
@jbphet did a snapshot at the time of this writing out of curiosity (18 minutes after the test) and it had climbed to 19.2 MB. So we can say ~28 minutes after startup the sim had not quite plateaued.
I talked with @samreid about using re-running this test with some Chrome flags set that might help us to reduce the impact of memory allocations from the browser and thus distinguish them from memory allocations made by the sim code. We decided to try setting the v8-cache-options
value to Disabled
. These flags can be set in Chrome ussing chrome://flags
in the address bar. Here are the results on the 2-screen Blast test:
Snapshot | Time (min) | Size (MB) |
---|---|---|
1 | 0:00 | 8.5 |
2 | 0:30 | 8.8 |
3 | 1:00 | 9.3 |
4 | 1:30 | 9.5 |
5 | 2:00 | 9.7 |
6 | 2:30 | 9.8 |
7 | 3:00 | 9.5 |
8 | 3:30 | 9.6 |
9 | 4:00 | 9.9 |
10 | 4:30 | 9.9 |
11 | 5:00 | 9.9 |
12 | 5:30 | 9.6 |
13 | 6:00 | 9.6 |
14 | 6:30 | 9.6 |
15 | 7:00 | 9.7 |
16 | 7:30 | 9.9 |
17 | 8:00 | 9.7 |
18 | 8:30 | 9.7 |
19 | 9:00 | 9.9 |
20 | 9:30 | 10.0 |
21 | 10:00 | 9.9 |
The total memory allocation is less than it was when the flag was set to Default
, but the amount of variation between the snapshots is pretty similar.
Here's a graph of the data in the previous comment:
We discussed this issue in the 5/12/2016 developer meeting and decided that the current state of affairs is acceptable. The big memory leaks have been fixed, and we have several tests that demonstrate that there are no major leaks in most sims. While there is some "noise" in the measurement (probably due to the way the V8 engine works, but possibly just because of where the snapshots occur during the operation of the sim), this doesn't prevent us from determining whether significant memory leaks are occurring. Closing.
While preparing a release candidate for Gravity Force Labs, @aadish and I ran a test where we loaded the sim with the
fuzzMouse=100
query param and periodically took heap snapshots using the Chrome developer tools. I expected to see the memory increase to some value and then stabilize, figuring that garbage collection would kick in and limit memory consumption. What we saw instead is that the memory kept increasing and, after several minutes, got to the point where trying to take a heap snapshot resulted in the dev tools disconnecting, requiring a reload of the sim to gather any more data. This was when the memory allocation was around 100MB, which seems like a lot.The developers discussed this in the 3/17/2016 developer meeting, and we're not sure whether it means that sims are leaking memory, but we agreed that it should probably be investigated. I'm on the hook to start looking at this.