Research how other apps measure startup performance

mcomella commented 4 years ago

As a follow-up to our future of cold startup discussions...

Instead of reinventing the wheel, let's find out how other apps are measuring cold startup (and briefly non-startup measurement).

We should double-check on licensing but Marc mentioned Chrome uses something called catapult: https://chromium.googlesource.com/catapult/

mcomella commented 4 years ago

I completed my research. In general, it was not easy to find 1) what metrics folks took on the app to measure startup and 2) the harnesses they used to run the application – because it wasn't straightforward, I'm assuming useful information is limited. Furthermore, I realized that our requirements are more intensive than most other developers because we not only want to catch regressions but we want to measure against other benchmarks (e.g. Fennec).

Here are a few solutions I found and who uses it:

Nimbledroid
- Pinterest
- NYTimes
(closed) MobileLab: built-in house by Facebook. Sounds similar to ours (article)
(closed) Harrier: built in-house by LinkedIn: is performance testing in production
RoboRemote (built in-house by GroupOn) + TestDroid (latter defunct?; article: seems like a UI interaction framework rather than a perf test driver
Firebase performance monitoring (docs) to measure at runtime
- Swiggy (article)
- It's default startup traces are insufficient (only to onResume) and custom trace API is similar to glean. It keeps CPU/mem data, which could be useful, but we could add straightforwardly
Appium: a test harness using webdriver, doesn't integrate performance directly

I think our best bet is moving forward with BT. However, I did also find great resources that demonstrated how to set up tests: e.g. removing noise and outliers, especially those targeted at Android. Those resources are:

Jetpack Benchmark: https://www.youtube.com/watch?v=ZffMCJdA5Qc&feature=youtu.be
Facebook's MobileLab: https://engineering.fb.com/android/mobilelab/

In particular, we may want to look at:

Locking CPU frequency to reduce noise from CPU throttling, device overheating
Investigating how disk performance variance could introduce noise and if we want to mitigate by moving to RAMFS
Whether listening to logcat during the test impacts perf enough to stop (is there something else we should do?)

FB managed to get their perf tests with minimal noise at 50 trials, which is better than we're doing (though we are measuring startup, which is a very complicated use case). They run hourly and have automatic bisection on regressions.

mcomella commented 4 years ago

I additionally took a look at the Chromium source to understand how they measure performance. I found some interesting docs:

Overview page
Overview perf waterfall, their CI pipeline
Overview of telemetry, their test runner
(irrelevant) Microbenchmarks: they measure very, very small bits of code as part of their measurements. I wonder why?
(irrelevant) How to pick a good metric

I additionally found the source code of their mobile startup benchmark and the back-end implementation for Android. I didn't learn much – there are many layers of abstraction – but it appears that like Facebook they also wait for throttling (source). I could dig in further to see what else they do but I don't think it's worth the time: I think the high-level MobileLab post from Facebook and the Google IO Jetpack Benchmark talk from Google probably cover what we would learn (at least to the 80-20 rule).

Conclusions

We should continue with our current approach for startup measurement (browsertime test harness with some extraction from FNPRMS) but we should leverage the lessons from Facebook's MobileLab blog post and the Jetpack Benchmark IO talk to identify steps we should take to reduce noise in our pipeline.

mozilla-mobile / perf-frontend-issues

Research how other apps measure startup performance #97

Conclusions