Open samreid opened 1 month ago
@samreid in FEL & GasProps, we had to detect mobile Safari and set webgl=false
. There is a common code issue tracking a more general fix in https://github.com/phetsims/scenery/issues/1649. Unfortunately, it doesn't appear to be prioritized and JO doesn't have access to the tools he needs to make progress on the issue.
Edit: I asked for the underlying scenery issue to be prioritized on the #planning
Slack channel.
This looks like a slightly different cause than in FEL / GasProps, since we're using three.js. This issue is my highest priority, I just had vacation, and it's a very tricky to reproduce thing for FEL.
I believe this is likely related to the memory consumption of having a number of ThreeStages (in WebGL, they all need to have different memory and resources). Launching with ?screens=1
allows Buoyancy to launch on my iPhone, where it fails to launch without that query parameter.
I believe Buoyancy might be using more textures and data than Density.
I think it will be necessary to create just a single ThreeIsometricNode, to pass back-and-forth between screens. Thus we'd want to score each screen's three.js content in a Three.Object3D, and add/remove it so it is just the content of the screen that is active. Additionally, the camera and projection matrices would presumably need to be updated.
Alternatively, it would be possible to explore reducing the size of included resources (but that probably wouldn't help as much).
@samreid thoughts? Should we collaborate on the suggested approach?
Testing on iPad 7 running iPadOS 17.1.2, on a built version of buoyancy with ?fuzz&screens=1, it crashed at around:
Testing published density (all 3 screens) with fuzz, it crashes in:
So if a 2m 30s fuzz crash is on production, perhaps we just need to do that well or better?
UPDATE: while tethered to safari, published density with all 3 screens fuzzed > 10 minutes with no crash.
When crashing in a webview on tethered xcode, I get these errors:
0x10f000a00 - [PID=601] WebProcessProxy::didClose: (web process 601 crash)
0x10f000a00 - [PID=601] WebProcessProxy::processDidTerminateOrFailedToLaunch: reason=Crash
0x10e01c720 - ProcessAssertion: Failed to acquire RBS Background assertion 'XPCConnectionTerminationWatchdog' for process because PID 0 is invalid
0x10e01c720 - ProcessAssertion::acquireSync Failed to acquire RBS assertion 'XPCConnectionTerminationWatchdog' for process with PID=0, error: (null)
0x10202e018 - [pageProxyID=6, webPageID=7, PID=601] WebPageProxy::processDidTerminate: (pid 601), reason=Crash
0x10202e018 - [pageProxyID=6, webPageID=7, PID=601] WebPageProxy::dispatchProcessDidTerminate: reason=Crash
Attaching to the memory profiler in xcode debugger shows the memory keeps climbing. I wonder if we should fix the memory leaks before testing this further?
After fixing #168, I built buoyancy_en.html, and tested it on my iPad 7. I used it for a full 6 minutes of interacting with every component ("manual" "fuzzing" as best I could), and it did not crash. We also saw that the behavior was corrected on the iphone after fixing memory leaks. However, fuzzing crashed it in 24 seconds in the 1st run and 25 seconds in the 2nd run.
Tethering to safari is very flaky, even for one screen fuzzing, but I managed to get this timeline memory snapshot:
Looks like every now and then there is a large memory spike and long GC time:
Oh, it's just every 10 seconds when it takes a snapshot.
Tethering to xcode, and fuzzing all screens of built buoyancy at fuzzRate=5, it seems there is a leak in the "other processes", which gained 400MB over 30 seconds:
Using "Instruments" we see that VM: IOSurface is leaking rapidly:
Here is the caller of those allocations:
Here is the stack:
In text
IOSurfaceClientLookupFromMachPort
-[IOSurface initWithMachPort:]
WebCore::IOSurface::createFromSendRight(WTF::MachSendRight const&&)
decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul>::__dispatch[abi:v160006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<WTF::Visitor<WebKit::RemoteLayerBackingStoreProperties::layerContentsBufferFromBackendHandle(std::__1::variant<WebKit::ShareableBitmapHandle, WTF::MachSendRight>&&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)::$_1, WebKit::RemoteLayerBackingStoreProperties::layerContentsBufferFromBackendHandle(std::__1::variant<WebKit::ShareableBitmapHandle, WTF::MachSendRight>&&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)::$_2>>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, WebKit::ShareableBitmapHandle, WTF::MachSendRight>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<WTF::Visitor<WebKit::RemoteLayerBackingStoreProperties::layerContentsBufferFromBackendHandle(std::__1::variant<WebKit::ShareableBitmapHandle, WTF::MachSendRight>&&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)::$_1, WebKit::RemoteLayerBackingStoreProperties::layerContentsBufferFromBackendHandle(std::__1::variant<WebKit::ShareableBitmapHandle, WTF::MachSendRight>&&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)::$_2>>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, WebKit::ShareableBitmapHandle, WTF::MachSendRight>&)
WebKit::RemoteLayerBackingStoreProperties::layerContentsBufferFromBackendHandle(std::__1::variant<WebKit::ShareableBitmapHandle, WTF::MachSendRight>&&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)
WebKit::RemoteLayerTreePropertyApplier::applyPropertiesToLayer(CALayer*, WebKit::RemoteLayerTreeNode*, WebKit::RemoteLayerTreeHost*, WebKit::RemoteLayerTreeTransaction::LayerProperties const&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)
WebKit::RemoteLayerTreePropertyApplier::applyProperties(WebKit::RemoteLayerTreeNode&, WebKit::RemoteLayerTreeHost*, WebKit::RemoteLayerTreeTransaction::LayerProperties const&, WTF::HashMap<WebCore::ProcessQualified<WTF::ObjectIdentifierGeneric<WebCore::PlatformLayerIdentifierType, WTF::ObjectIdentifierMainThreadAccessTraits>>, std::__1::unique_ptr<WebKit::RemoteLayerTreeNode, std::__1::default_delete<WebKit::RemoteLayerTreeNode>>, WTF::DefaultHash<WebCore::ProcessQualified<WTF::ObjectIdentifierGeneric<WebCore::PlatformLayerIdentifierType, WTF::ObjectIdentifierMainThreadAccessTraits>>>, WTF::HashTraits<WebCore::ProcessQualified<WTF::ObjectIdentifierGeneric<WebCore::PlatformLayerIdentifierType, WTF::ObjectIdentifierMainThreadAccessTraits>>>, WTF::HashTraits<std::__1::unique_ptr<WebKit::RemoteLayerTreeNode, std::__1::default_delete<WebKit::RemoteLayerTreeNode>>>, WTF::HashTableTraits> const&, WebKit::RemoteLayerBackingStoreProperties::LayerContentsType)
WebKit::RemoteLayerTreeHost::updateLayerTree(WebKit::RemoteLayerTreeTransaction const&, float)
WebKit::RemoteLayerTreeDrawingAreaProxy::commitLayerTree(IPC::Connection&, WTF::Vector<std::__1::pair<WebKit::RemoteLayerTreeTransaction, WebKit::RemoteScrollingCoordinatorTransaction>, 0ul, WTF::CrashOnOverflow, 16ul, WTF::FastMalloc> const&)
WebKit::RemoteLayerTreeDrawingAreaProxy::didReceiveMessage(IPC::Connection&, IPC::Decoder&)
IPC::MessageReceiverMap::dispatchMessage(IPC::Connection&, IPC::Decoder&)
WebKit::WebProcessProxy::didReceiveMessage(IPC::Connection&, IPC::Decoder&)
IPC::Connection::dispatchMessage(std::__1::unique_ptr<IPC::Decoder, std::__1::default_delete<IPC::Decoder>>)
IPC::Connection::dispatchIncomingMessages()
WTF::RunLoop::performWork()
WTF::RunLoop::performWork(void*)
__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
__CFRunLoopDoSource0
__CFRunLoopDoSources0
__CFRunLoopRun
CFRunLoopRunSpecific
GSEventRunModal
-[UIApplication _run]
UIApplicationMain
0x1c6f81760
0x1c6f81610
0x1c6c7fae8
0x1042b553f
main
start
Running with ?screens=1&fuzz&fuzzEvents=5 still shows the same leak, so likely reusing the same three.js canvas won't solve the problem:
More discoveries:
Surprisingly, removing all Sprites from FEL still shows a lot of churn in Total Bytes, but stabilizes in persistent bytes 182MB:
Harness:
Leak in VM: IOSurface
UPDATE: Some of my conclusions above may be wrong since I have been looking at total MB instead of persistent MB.
Disable cache in wkwebview:
Here is a minimal patch that shows that changing screen.view.setVisible( visible );
to screen.view.setVisible( true );
in Sim.ts avoids the VM: IOSurface leak in FEL, on my tethered ipad in Instruments. This means the ScreenView.setVisible(true)/setVisible(false) is cascading to the leak.
By the way, this is FEL in main right now, which does exhibit the leak (even though there is no webgl):
Another minimal patch with no leak, by showing all screens at once:
Running buoyancy with ?webgl=false shows only 5 VM IOSurfaces. Surprisingly, changing the warning message to use canvas, starts leaking IOSurfaces:
If you get the dom element size small enough, there is no leaking of IOSurfaces:
Self-contained example leaks 311 IOSurfaces within 30 seconds:
UPDATE: Same test shows no leak on iOS simulator
Leaked 20GB in 34 minutes in buoyancy. When will it crash?
It could be that there were crashes in the above trace but the instrumentation just kept going, but I did watch carefully for 5 minutes on a new run to see it go >4GB and it did not crash.
I used the built buoyancy sim for several minutes and it did not crash. I'm not sure how we will confidently proceed here.
Let's talk at standup about what else should be done before QA test.
At today's standup, we discussed that we have been able to run on iPad pretty well and aren't too concerned about more work on it before QA. We agreed the proposed refactoring to use a single webgl context sounds like it would be very difficult and we don't want to do it. When we QA test, we would like to request testing on iPhones. Closing.
I have an iPad 9th generation with iOS 17.6.1. I started testing Buoyancy in https://github.com/phetsims/qa/issues/1136 and experienced 2 crashes within 5 minutes of testing. Unfortunately, it happens randomly--one time while switching screens, another time while on the Compare screen--so there aren't steps that I can give you to reproduce but interactive highlights was on in both cases. I have noticed that the crash happens pretty easily when adding a query parameter(e.g. ?stringTest=double). The sim crashes before it even finishes loading.
On my iPhone 12 Pro with iOS 17.5.1, the sim immediately crashes and I get this message:
Another easy way to get my ipad to crash is to go to the Compare, Lab, or Shapes screen and try and take a screenshot using the tool from the PhET Menu. EDIT: Turn on interactive highlights first
@Nancy-Salpepi can you please see how much it crashes if running with ?screens=1 and ?screens=1,2
Can it be crashed without interactive highlights?
can you please see how much it crashes if running with ?screens=1 and ?screens=1,2
--I didn't get a crash without interactive highlights on with the above query parameters. With ?screens=4 and a styrofoam duck, taking a screenshot with the tool in the PhET menu causes a crash every time. Aside from that, I didn't get a crash.
--Sometimes the sim crashes when I am modifying the url. https://github.com/user-attachments/assets/fb97732a-219c-4337-ab93-40b23e0ed780
Can it be crashed without interactive highlights?
--It repeatedly crashes on my iPhone. It never even opens at all. --on the iPad, it has only crashed twice without interactive highlights on over a large span of time.
I did tether the iPad to my mac and webGL warning messages do pop up when the sim loads (but nothing further when it crashes):
I also asked @Matthew-Moore240 and @Ashton-Morris to take a look on their iPhones since they have a different model.
Let me know if there is anything else I can do.
No crashes for me but the sim does have various UI issues. Pictures are under details.
iPhone 15 pro max.
Latest iOS
At first it didn't crash and the UI looked fine. But after reloading it a few times it did crash with the same message as above with Nancy.
At one point it asked me if I wanted to lower privacy restrictions-in safari when using the sim to improve performance (this was before the and after crash).
It worked well for a long time and then crashed again after reloading it. Tried it on regular and private browsing in safari mobile.
iPhone 15 pro max.
Latest iOS
At first it didn't crash and the UI looked fine. But after reloading it a few times it did crash with the same message as above with Nancy.
At one point it asked me if I wanted to lower privacy restrictions-in safari when using the sim to improve performance (this was before the and after crash).
It worked well for a long time and then crashed again after reloading it. Tried it on regular and private browsing in safari mobile.
I also experienced a crash after reloading the page a couple of times. I had forgotten to turn off an app that controls dark mode in the browser, which was causing the weird graphical issues. It still crashes after a few reloads but the sim looks normal now. Apologies!
I didn't experience any crashes with ?webgl=false, including 10 minutes of fuzzing.
Nice! Can you please run another test. Delete ?webgl=false (so webgl is back on, and we will see the blocks). And try a test with supportsInteractiveDescription=false
instead? Thanks!!
With supportsInteractiveDescription=false
the sim crashed twice in 5 minutes while fuzzing. Aside from that, I was only able to get it to crash when taking a screenshots during ~25 minutes of testing.
@Nancy-Salpepi volunteered to test supportsInteractiveDescription=false
on the iPhone, thanks!
Should we test https://phet.colorado.edu/sims/html/density/latest/density_all.html (density 1.1) for crashing behavior, so we can narrow down when it was introduced?
OK! here is the latest update from me:
?supportsInteractiveDescription=false
to the Buoyancy url and it still crashes repeatedly. @Nancy-Salpepi these tests are very helpful. Let us test the hypothesis that the complex geometries (boat, bottle, duck, etc) may be related to the problem. Can you test this next?
On my iPhone: Buoyancy opens with ?screens=1,2,3, ?screens=1,2,3,4, ?screens=1,2,3,5, and ?screens=4,5
10 minutes of fuzz testing with ?screens=1,2,3 produced no crashes 10 minutes of fuzz testing with ?screens=4,5 produced no crashes 10 minutes of fuzz testing with ?screens=1,2,4 produced no crashes
I added a script so I can tell how many times the sim has crashed, otherwise you have to watch very carefully to see when the crash occurs:
In this patch, I saw many many crashes on my iPad and iPhone. Here is a table that notes the run number as a function of time. Successive run numbers indicate crashes. This is testing a new build from main (includes recent grab drag handler keys, which have a memory leak):
iPad fuzzing built version
6:43pm run 1 starting crash at 6:45pm. Correctly says "run 2"
6:45pm run 2
6:50pm run 3
6:52pm run 4
6:56pm run 5
6:58pm run 6 - NOTE: it is crashing around once every 3 minutes
7:00pm run 7
7:03pm run 8
7:07pm run 9
7:14pm run 10
7:17pm run 11
iphone 15 pro max full built just crashes on startup.
Setting screens=1,2,3 it can launch.
screens=1,2,3,4 crashes on startup
screens=1,2,3,5 fuzzes
screens=4 (shapes)
6:56pm run 3
6:58pm run 4 NOTE: crashed with only one screen, but it was the shapes screen
7:00pm run 4
7:03pm run 5
7:07pm run 7
7:10pm run 8
7:17pm run 9
However, in chrome, it just looks like a standard memory leak:
Let's make sure the memory leak is stopped before running more tests here. My results here are all for a build on main, not for the dev test.
@Nancy-Salpepi I wanted to double check that the tests you reported above are for the built dev test https://phet-dev.colorado.edu/html/buoyancy/1.2.0-dev.4/phet/buoyancy_all_phet.html?fuzz (or maybe _en?) and not on phettest. Let me know either way.
Yes. They were for the dev version. Let me know if there is something else you need me to check or recheck....or if you wanted me to do some memory testing with the dev versions.
Patch that adds a query parameter ?threeRendererPixelRatio
:
Patch that adds an option to antialias the three renderer:
The iPhone 15 Pro Max crashes repeatedly on startup with antialias=true, but both iPad 7 and iPhone have been fuzzing with antialias=false for > 15 minutes with no crashes.
I'll request review from @zepumph, particularly about whether we want to use the nested options pattern or not. After that's committed, we could request testing (on a built version) from QA. Or could do that in next dev or rc test. Or if we want more tuning parameters, we could add threeRendererAntialias and threeRendererPixelScale to the query parameters.
I sent this to @jonathanolson and @zepumph in slack:
We found that by disabling antialiasing on mobile safari, we were able to fuzz 15+ minutes with no crashing on the iPad and iPhone, so that seemed good. We also saw that on the iPhone 15 Pro Max the pixel density is 3, so on mobile safari we also reduced the pixel ratio to 0.9 * devicePixelRatio. It still looks good enough on iPad and iPhone. We additionally created query parameters in MobiusQueryParameters. now it has
I also took several (10+) screenshots on the iPad and it did not crash.
We think we are ready to see how it fares on QA devices. If QA shows consistent crashing, we can request additional testing with the query parameters to see how it affects things.
I also hypothesized that mobile safari sometimes has a hidden memory state, and that one crash somehow "clears" it, so that once it crashes (perhaps on first launch) then it autoreloads and has plenty of memory to continue.
I’m convinced that using a single webgl canvas would be a good memory savings, but I’m also concerned that could be time consuming, and/or a source of complexity or bugs. But that’s probably just uncertainty since I don’t know how it would be implemented or the tradeoffs.
The next time we begin a fresh a webgl-intensive simulation I recommend we investigate memory limits and crashing behavior on devices early on. Developing a new simulation with a shared canvas may be easier than retrofitting one?
Here's a screenshot from the iPad 7 with aliasing and 0.9x pixel ratio:
By the way, we also observed that with these changes, the frame rate when dragging a block on the iPad is significantly higher. It went from roughly 25fps to roughly 60fps and feels much snappier. Dragging boat or bottle remains slow though.
In Faraday's Electromagnetic lab, mobile safari WebGL had to be disabled to prevent crashing, see https://github.com/phetsims/faradays-electromagnetic-lab/issues/182#issuecomment-2168713207
In Gas Properties, the same thing happened, see https://github.com/phetsims/gas-properties/issues/289#issuecomment-2246277100.
Today I observed that neither unbuilt nor built buoyancy launched on the iPhone 15 Pro Max, see https://github.com/phetsims/density-buoyancy-common/issues/315
@KatieWoe also identified rendering issues on the iPad, which is puzzling--since it means it must have run on iPad at least a bit. https://github.com/phetsims/density-buoyancy-common/issues/230
@arouinfar can you please advise?