shorebirdtech / shorebird

Code Push for Flutter and other tools for Flutter businesses.
https://shorebird.dev
Other
2.34k stars 142 forks source link

fix: low link percentages on iOS #1892

Closed eseidel closed 3 months ago

eseidel commented 7 months ago

Creating a meta-bug to track our various reports of low-link percentages on iOS (low % on CPU).

Ones we're tracking:

We're also working on https://github.com/shorebirdtech/shorebird/issues/1820 so it's easier for devs to help us see what specific cause their app might be hitting.

eseidel commented 7 months ago

Update: shorebird patch records the link percentage it sees in our database (along with the xcode version and whether the force flags were passed). When we last looked at the number (on Monday April 7th), right before 1.0 release) we saw about 1/3rd of patch commands getting 99% linking (expected), about 1/3rd getting 20-99% (not great, but possible), and 1/3rd getting <20% (not expected or good). We'll pull the numbers again on Monday and see if that has held through the 1.0 release. Seeing this we immediately began working on link percentage improvements after 1.0 launched. We were aware of https://github.com/shorebirdtech/shorebird/issues/1825 at the time. When we went to investigate https://github.com/shorebirdtech/shorebird/issues/1825 we built a bunch of linker tests and found that removing a constant from the object pool (where dart stores the constant values in the snapshot) caused linking to drop precipitously, so we went after that first. It was a one line change to fix (which we had in about an hour). However it caused our VM tests (about 10,000 tests) to link differently (better) and revealed a couple more crashers in the runtime which we spent the last 3 days fixing. The work was pretty much done as of Friday night, just needed one more round of testing before releasing (was too late to go out Friday) so will go out Monday. That should improve linking across the board, and resolve 2 other issues we found (one crash when building with--dwarf-stack-traces, and one issue where catch clauses would sometimes use the wrong constant values).

After that goes out on Monday, I expect https://github.com/shorebirdtech/shorebird/issues/1825 will take us most of this coming week to resolve, hopefully that will be out by ~April 20th and I would then expect patches to see link rates above 90% about 90% of the time (back of the napkin).

eseidel commented 7 months ago

The 10-15% link improvement patch didn't go out today (some last minute further fixes were needed). But is planned to go out tomorrow (with a bunch of other nice fixes) in 1.0.1. https://github.com/shorebirdtech/shorebird/pull/1901

Once that's out, we'll start working on https://github.com/shorebirdtech/shorebird/issues/1825 which is our last known linker issue which we believe to be responsible for the vast majority of any remaining incidence of low link percentage (which results in slowness after patching on iOS).

eseidel commented 7 months ago

We found another couple bugs as part of our 1.0.1 work which delayed things further and we didn't get the release out yesterday. In the process of releasing now. https://github.com/shorebirdtech/shorebird/issues/1916

eseidel commented 7 months ago

We shipped 1.0.1 yesterday which included several fixes for iOS, including a linker fix which should improve linking by 10-15% for most patches.

Felix also investigated our remaining cases with low link percentages and found that they do in-fact seem to be caused by changes to the class table, and we'll begin working on a solution tomorrow. It will probably take us a couple weeks to fix, but should be similar to the fix we had to do for the ObjectPool last month.

Erick and Felix also completed a change today which adds an option to shorebird patch to collect linker debugging information. That will probably go out next week sometime and will help us when working with customers who see any unexpected drops in link percentage. After this upcoming class table fix, I expect that 90% of customers will see link percentages above 90% and we will probably remove displaying of the link percentage at all during linking (at least when the value is above 90% as expected).

eseidel commented 6 months ago

Update: Felix has identified that there are actually two related, but separable fixes remaining. One is sorting the ClassTable, a second is sorting the DispatchTable. He's working on the ClassTable sorting now, we expect we'll ship it either this week or early next. The dispatch table one will be another couple weeks. The ClassTable fix should improve linking by ~25 percentage points in affected snapshots. The DispatchTable fix will be the rest of the change and should get most patches well into 90% linking.

eseidel commented 6 months ago

Update: Felix got a new linker pipeline wired up to sort the ClassTable today. It seems to mostly work but had at least one bug which stopped us from landing the patch today. I suspect it's only another couple days and we'll have a fix for sorting the class table. I'm hopeful that the DispatchTable sorting after that will go much faster since we've (again) learned so much as part of this ClassTable work and the DispatchTable is highly related. We'll see. Expect an update again from me around mid week next week.

eseidel commented 6 months ago

Update: Felix has a working solution for ClassTable sorting. (We still have to do DispatchTable sorting after this.) We expect to release the ClassTable sorting today (or possibly tomorrow), which should cause link percentages to jump as much as 50 percentage points for affected patches.

eseidel commented 6 months ago

ClassTable sorting is done and landed. Much of the team is traveling on Monday so we're unlikely to get a release out, but we will definitely get it shipped by Tues.

eseidel commented 6 months ago

1.0.5 contains the ClassTable sorting and will be released in a few minutes: https://github.com/shorebirdtech/shorebird/pull/2022

We will then re-assess to see what sort of link-percentages are being reported by users and then likely fix the GlobalDispatchTable sorting (which is the last known reason for link percentages to be low).

felangel commented 6 months ago

1.0.5 is now out and includes the class table sorting 🎉 Let us know if you encounter any issues.

eseidel commented 6 months ago

Update: Last week the entire team gathered in person and we intentionally spent that time fixing other (smaller) issues together (also in part on-boarding Erick onto the team). Felix resumed work on this issue earlier this week and has made good progress (he's built a bunch of tooling around debugging the Global Dispatch Table, but not yet built the code to sort it). We're hopeful we can ship a fix for the Global Dispatch Table sorting (the last known remaining issue causing low link percentages) in the next week or two.

eseidel commented 5 months ago

Update: Felix and I have stared at this for a few more days. It's quite complicated. Our current understanding:

The Dispatch Table is described a bit here: https://github.com/dart-lang/sdk/blob/9077bf991f3471d61c8c3b3584c8ac11a7a40bcd/runtime/docs/README.md#global-dispatch-table-gdt

It's basically a 2 dimensional array, keyed by cid and selector_id. I initially thought that it was primarily grouped by cid, but it turns out it's grouped first by selector_id. (e.g. all of toString ends up in a contiguous block, and then it's just the cid offset within that block to figure out where your classes toString jumps to). There is also some fancy compression in that they don't use global selector ids, rather they will re-use selector ids, to allow sharing space between classes.

In any case, this makes the Dispatch Table very sensitive to the size of the class table. Which is why our current testing "delete class" is linking at 99% where as "add class" is linking at 70%. This is because in the "delete class" case, we end up sorting the class table (what we did last month) to end up with a place-holder for the deleted class (to keep the rest of the class ids the same) and thus the dispatch table is largely unaltered. In the add class case, we end up with the patch's class table larger than the base's. which ends up with a dispatch table with a different "stride" or row-size.

The way that dispatch table lookups work is that there is a "selector offset" (e.g. row-start) compiled into the instruction stream as a literal, and then the lookup is done as selector_offset + cid. The selector offsets are relative to a bias which is there just to take advantage of the extra one bit provided by signed math (since the instructions are signed).

In any case, we're making progress. This stuff is hard (harder than I thought), but we're trying a theory now and hope to have something shipping to fix this in another week or so.

eseidel commented 5 months ago

Update: Felix now has a proof-of-concept fix, which moves our "add a class" benchmark from 60% linking to 100% linking and our "add an empty class" benchmark from 30% linking to 80% linking. He's still debugging why the empty variant is getting confused (it has something to do with triggering different code-paths in the type-checking logic). His current proof-of-concept is likely to increase snapshot sizes and memory usage, so it may be a few days yet of tuning before we release it, but good to know that we're nearly done fixing this last big iOS issue. 😮‍💨

eseidel commented 5 months ago

We made a lot of progress yesterday. Today is Friday, so we'll probably not release a fix today, but I think we might have at least a small bump to this next Monday or Tuesday. We're close now, we think we see all the issues we need to fix, we just need to write production-ready fixes for such.

molestein commented 5 months ago

We made a lot of progress yesterday. Today is Friday, so we'll probably not release a fix today, but I think we might have at least a small bump to this next Monday or Tuesday. We're close now, we think we see all the issues we need to fix, we just need to write production-ready fixes for such.


Thank you guys for all the hard work! we truly appreciate your dedication!

eseidel commented 5 months ago

Planned with Felix again this morning. We have 3 ideas for what to improve, we're tackling moving our "object pool" mapper to a less-rigid system today and hope to ship that. Our fixes to the "global dispatch table" will likely have to wait until next week. We also have a 3rd plan to potentially pad the class table to make changes to the global dispatch table a bit less disruptive, but that we will also likely explore later this week early/next. I'm hoping we can get the object pool mapping changes polished up to ship today. Those should change our linker to be less "brittle" (it will do slightly less well on our small unit tests, but hopefully better in the wild based on what data we've seen so far).

eseidel commented 5 months ago

Felix landed a small fix today which will go out in 1.1.9 tomorrow morning. It improves link percentages a small amount across the board, particularly for larger changes. He did more prototyping today and has a set of 3 small fixes he's pursuing which we believe that when combined may solve this issue. We're very close, just a few days of work I think. 🤞

eseidel commented 5 months ago

Update: Felix landed anther fix on Friday (tweaking the dispatch table sort to use large row lengths) which improved linking in Flutter examples by ~20%. We took a break from linking on Friday to work on https://github.com/shorebirdtech/shorebird/issues/1986, which we've had two reports of from the wild, and were able to reproduce in our tests on Friday. I expect we'll spend Monday fixing that and then the rest of next week shipping Felix's Dispatch Table fixes. Hopefully this bug will be "done" as soon as end of next week. 🤞

eseidel commented 5 months ago

The fix that we shipped this week for the dispatch table, made the rows regularly spaced. Dart's "dispatch table" is used for deciding what function implementation to run when encountering a non-nullable, non-dynamic function dispatch. (dynamic and nullable dispatches are handled differently). The choice is "for given class A, what subclass implementation of should we use".

Take for example "toString", it's implemented on Object, so whenever toString is called the VM has to decide which classes implementation of toString to use (since if you have A : B : Object, either A's, B's or Object's toString could be the right answer). The naive implementation of such a table requires a lookup in a table which is number_classes long, and number_unique_selectors (method names) wide. That would use a lot of space for large apps, so they have a fancy compression scheme where instead of looking up a row (which selector) and then the cell within that row (which class), they pre-compute row offsets (essentially row starts) and do so in a manner whereby they interlace the actual storage of the rows to avoid wasting space for classes which do not actually ever use that selector (or even have it implemented).

The problem for that with Shorebird, is that small changes to a dart program can cause large changes to these offsets, causing large sections of the resulting compiled program to be unsharable (low link percentage). This is the root cause of all of our current linking troubles (this bug).

The partial solution that Felix implemented last week was to teach the "dispatch table builder" to hand out regular offsets (always rounding to a power of two), basically disabling some of the compression logic. This made it much less sensitive to small changes in the Dart program, since there was already some "padding" built into the table now to handle small perturbations. However that's not a full fix.

The full fix will involve us actually sorting these rows into a consistent order (or importantly making sure that when building a "patch" we sort into the same row order that was used when building the "release"). We know exactly how we're going to do that, it just will require us to plumb some extra information around through the Dart compiler to make it happen. Should probably take us a couple days this week to get it working 🤞 and we'll then know if that was really the "last" bug blocking Shorebird from consistently "linking" at 99%+ (and thus running 99% of patch on the CPU rather than the interpreter). I'm hopeful. Thank you all for your continued patience.

eseidel commented 5 months ago

We fixed https://github.com/shorebirdtech/shorebird/issues/1986 today, which we will release tomorrow (along with a small fix that should improve link percentages on iOS a little). Then we'll resume work on the dispatch table in the afternoon and hopefully have a "sort the dispatch table" fix out as soon as weds afternoon. 🤞

eseidel commented 5 months ago

Update: Felix discovered that https://github.com/shorebirdtech/shorebird/issues/2237 is actually a regression from some of the speedups we did (specifically making the Global Dispatch Table use power of 2 row offsets) last week, so he landed a revert of that today (which will probably go out tomorrow). He also wrote up a patch last week to disable function inlining in Dart, which is a massive linking win at the cost of some peak performance on microbenchmarks for release builds. That will also be going out later this week.

eseidel commented 4 months ago

Update: Felix has a working dispatch table sorting change. We're working on the mechanics of actually shipping it (it requires changing the dart front-end compiler which was not a part of Dart we'd previously changed before, so we need to fix up a bit of our deployment infrastructure first to support the additional artifact). I expect we'll ship his change Weds or Thurs of this week and then look into the black/white screen hangs we've seen a few reports of the last few weeks. I expect that his coming dispatch table sorting change will be a very large improvement in link percentages. We believe there is more we can get from that change, but we're getting out a simple version first and then will tune it in subsequent releases.

eseidel commented 4 months ago

We believe this is now fixed as of last Friday's release. You've not seen us make noise about it yet, because we're investigating https://github.com/shorebirdtech/shorebird/issues/2219 which appears to have crept in sometime in the last couple months.

Regardless, we now believe that customers using Shorebird v1.1.15 with Flutter 3.22.2 should be getting consistently high (90% or better) link percentages.

Once we've had a few more days/weeks of data from the field we will remove the link percentage message all together, as it should no longer be an issue.

eseidel commented 4 months ago

Felix thinks we have a little bit more to do here (he still has a couple cases which he thinks are unexpectedly low), so leaving this open for now.

eseidel commented 4 months ago

Felix wants to make one more fix before we close this. But this issue should be essentially "done" after the 1.1.16 release just now (requires a fresh shorebird release with Flutter 3.22.2 or later), we just have one more fix which should make it much less likely for "large" patches to have lower-than-expected link percentages.

eseidel commented 4 months ago

We believe this is fixed with 1.1.18 + Flutter 3.22.2. Please let us know if you see any other cases where you're seeing a low link percentage. (We've seen one or two reports of low link percentages since 1.1.18, but we expect we'll track those as separate bugs if verified. So far our metrics from the field report and average of 97.4% link for patches.)

Thank you all again for your patience!

eseidel commented 4 months ago

We got a couple reports on Friday from larger apps trying iOS and @felangel believes there is at least one more small fix to make here.

eseidel commented 3 months ago

Update: Felix has been working with some large apps to diagnose why they've seen occasional low-link percentages. He's figured out that the problem seems to be our ObjectPool sort isn't being fed enough data (or is discarding too much of the data it is being fed). He has a reduction of one such bad case, where adding a single Widget subclass is hitting an 80% link (where as other similar changes are 99%), which he's investigating now. He's optimistic we'll ship a fix for at least one such case early next week.

eseidel commented 3 months ago

Felix has found the issue which we believe is causing the unexpected link drops. Our linker had a case where it was sometimes mis-analyzing the changed dart program. It turns out this error was subtle (in a low-level of the system), but non-fatal (wouldn't ever cause a crash, just worse code sharing) so we didn't catch it sooner. Anyway, Felix has isolated it and is working on a fix now. I expect we'll ship a fixed version on Monday.

eseidel commented 3 months ago

We have a local fix, we're doing some final testing and then hope to ship today or tomorrow.

eseidel commented 3 months ago

Finally! We believe this is now resolved when releasing for iOS with Flutter 3.22.3 and above from Shorebird 1.1.22 and above.

In the last week Felix found some major confusion in our "linker" code where it was comparing parts of the "release" and "patch" that it wasn't intending to and thus causing cascading linking failures for even simple changes. How this bug lasted for so long was that it was non-fatal (nothing ever crashed) and hard to predict when it would happen (some changes would be fine, others would trigger this), but he improved our debugging tools and tracked it down and this should greatly improve link percentages in the wild. Once we see that users are all getting 90% or better linking like we expect, we'll remove the "link percentage" method in its entirety (users should never have to worry about "link percentage" patches should "just work")!

amrebada commented 3 months ago

I am still facing the same issue that causes slow in the IOS app after patch installing

Screenshot 2024-08-08 at 5 58 31 PM

command results

shorebird --version

Shorebird 1.1.25 • git@github.com:shorebirdtech/shorebird.git
Flutter 3.22.3 • revision 19f36dba936d7c1406a470366a3a936f1242eeb5
Engine • revision cbc4b41c3a26db5e2b72a10425ba9669c2ddfae4

flutter doctor -v

[✓] Flutter (Channel stable, 3.22.3, on macOS 14.6 23G80 darwin-arm64, locale en-AE)
    • Flutter version 3.22.3 on channel stable at /Users/<username>/flutter/flutter
    • Upstream repository https://github.com/flutter/flutter.git
    • Framework revision b0850beeb2 (3 weeks ago), 2024-07-16 21:43:41 -0700
    • Engine revision 235db911ba
    • Dart version 3.4.4
    • DevTools version 2.34.3
...
[✓] Xcode - develop for iOS and macOS (Xcode 15.3)
    • Xcode at /Applications/Xcode.app/Contents/Developer
    • Build 15E204a
    • CocoaPods version 1.15.2
...
[✓] Network resources
    • All expected network resources are available.
eseidel commented 3 months ago

That you very much for reaching out here as well as connecting over discord @amrebada. We'll track any fixes needed for your app in https://github.com/shorebirdtech/shorebird/issues/2406. Thanks!

eseidel commented 3 months ago

We've spent the last couple weeks assisting existing Android customers in adopting Shorebird in their iOS builds as well. We've found a couple issues, including a couple link-percentage issues.

The current one we're chasing suggests that changes that involve adding/removing usages of class static fields might end up causing unexpectedly low link percentages. Felix is addressing as we speak and we'll likely have a fix out early next week.