Validate mach perftest VIEW against FNPRMS

mcomella commented 4 years ago

acreskeyMoz has validated that mach perf-test perf is roughly comparable to FNPRMS perf as we currently run it. We should additionally ensure they're comparable from a FE perspective.

In order to do this, we'll need to:

fix FNPRMS
change FNPRMS end execution point to navigationStart (like mach perf-test)
(with acreskeyMoz?) disable performance tuning on mach perf-test

Then the results should be comparable. If not, investigate why not (with acreskey)!

_edit: For a summary of the issues we investigated in this bug, see the mid-way Summary).

mcomella commented 4 years ago

edit: I updated the numbers and strikethrough'd an invalid theory. I had made a manual adjustment to FNPRMS measurement I had forgot about and needed to revert.

mach perftest in FNPRMS parser:

raw: 1.835s
w/ additional delays (10s total): 1.599s
w/ additional delays & conditioned profiles: 1.969s

FNPRMS:

raw: 1.2628s
w/ adb shell am set-debug-app & fresh conditioned profile used for each iteration: 2.038s
- You need the set-debug-app, the debug config file, and the profile set up correctly
- Curiously, even though we use the new profile, the number of tabs increases for each run; I can tell the profile is getting used (settings.json timestamp changes) but I don't know why the session isn't restored

My current theory is that mach perftest is delayed because we're waiting for marionette to attach to continue the test. FNPRMS with set-debug-app similarly attaches marionette but, since it can't connect to the remote server, I suspect it waits longer and times out.

N.B. the current Nightly logs in a different way so FNPRMS parsing is broken: "3" needs to be changed to "2" on these lines: https://github.com/mozilla-mobile/FNPRMS/blob/3389f021f7e0cc91b9205a9972cebc507e32398f/times.py#L175-L181 Furthermore, there are more changes to FNPRMS logging (e.g. app IDs) that I don't think have made it upstream yet that I have changed locally.

I additionally tried a few things and saw no significant impact:

Add text/html arg to perftest
Add --es args -marionette\\ -profile\\ /mnt/sdcard/org.mozilla.fenix-geckodriver-profile to FNPRMS (this seems to be an outdated flag and it's determined through CL changes)

mcomella commented 4 years ago

edit: Strikethrough b/c I used the wrong numbers, see above.

There is also this log from perftest:

1598906053729 mozdevice DEBUG execute_host_command: >> "shell:am start -W -n org.mozilla.fenix/org.mozilla.fenix.IntentReceiverActivity -t text/html -a android.intent.action.VIEW -d https://example.com --es args -marionette\ -profile\ /mnt/sdcard/org.mozilla.fenix-geckodriver-profile"

1598906054307 mozdevice DEBUG execute_host_command: << "Starting: Intent { act=android.intent.action.VIEW dat=https://example.com/... typ=text/html cmp=org.mozilla.fenix/.IntentReceiverActivity (has extras) }\nStatus: ok\nLaunchState: COLD\nActivity: org.mozilla.fenix/.HomeActivity\nTotalTime: 493\nWaitTime: 494\nComplete\n"

~~The discrepancy between FNPRMS and perftest is around 310ms – I wonder if this log about a WaitTime of 493ms (units?) is related.~~

mcomella commented 4 years ago

I did a comparison over 15 runs on my Pixel 2 with Nightly 200901 using mach perftest w/ the added delays and conditioned profiles through FNPRMS' log parser vs. FNPRMS with conditioned profiles:

perftest: 1.9758s
FNPMRS: 2.0188s
diff: 43ms

The difference is within the noise, especially since this loads a page and FNPRMS restores its own session instead of what's in the profile, so I feel like FNPRMS and mach perftest are roughly equivalent for VIEW with the added changes. However, we still have additional investigation to do to replace FNPRMS:

Simple modifications to mach perftest
- What delay should we actually add to mach perftest? Balance noise reduction + runtime
- How many iterations does mach perftest run in CI? How many should it run? Balance noise reduction + runtime
More complex modifications to mach perftest
- Why are conditioned profiles slower in mach perftest than no conditioned profiles?
- What should we do about performance tuning?
- The logcat is missing data on my GS5: it should take logcat more regularly if we want it all in the artifacts (currently, perftest results capture correctly but I can't send it to FNPRMS without the full logs)
set-debug-app & friends
- What's the performance impact of set-debug-app, assuming we don't add any code that changes behavior (which currently happens)?
- What's the performance impact of the code that set-debug-app inspires? Can we do better? There's a 756ms difference between FNPRMS without the flag & with the flag + conditioned profiles
MAIN: goes directly to onboarding instead of the homescreen (https://github.com/mozilla-mobile/fenix/issues/13470 ?)
Do we transmit data over adb during the whole test? I have no reason to believe we do (actually, it's more likely we're not because we just dump logcat at the end) but it's worth considering. MarcLeclair mentioned there's a performance impact from doing that

mcomella commented 4 years ago

I got numbers for the GS5 (limited to 4 runs b/c mach perftest overflows the logcat output):

perftest w/ modifications: 4.73s
- ['4.660000', '4.730000', '4.750000', '4.780000']
FNPRMS w/ modifications 4.6125s
- ['4.400000', '4.640000', '4.660000', '4.750000']

These would probably be close enough with enough runs (I expect this device to be noiser than the P2 too).

acreskeyMoz commented 4 years ago

  * What delay should we actually add to mach perftest? Balance noise reduction + runtime

To determine the optimal number, I would push a few options to try. i.e. ./mach try fuzzy --full and then select the VIEW tests. We can then compare the results and see the impact.

  * How many iterations does mach perftest run in CI? How many should it run? Balance noise reduction + runtime
Right now it's 14 per test. https://searchfox.org/mozilla-central/rev/84922363f4014eae684aabc4f1d06380066494c5/taskcluster/ci/perftest/android.yml#61 We chose this some time ago as the results seemed to stabilize. (Sorry, I don't have the data handy). Because the additional iterations don't take much time compared to test setup, I think we should increase this if we can demonstrate that it produces more stable results. Already, the results are sufficiently stable to sheriff.
* More complex modifications to mach perftest

  * Why are conditioned profiles slower in mach perftest than no conditioned profiles?

That is a very good question, and I think it's likely to be something on the platform side. I'd like to take this on as a task.

  * What should we do about performance tuning?
This bug tracks what we saw with our current performance tuning: https://bugzilla.mozilla.org/show_bug.cgi?id=1649511 Greg did a nice job analyzing the noise. We can make changes to the perf tuning specifications and push them to try. But the fact that the current G5 tuning helps pageload might make this problematic to optimize for both cases.
  * The logcat is missing data on my GS5: it should take logcat more regularly if we want it all in the artifacts (currently, perftest results capture correctly but I can't send it to FNPRMS without the full logs)

I wonder if this related to the --android-clear-logcat perf test option? https://searchfox.org/mozilla-central/rev/84922363f4014eae684aabc4f1d06380066494c5/python/mozperftest/mozperftest/tests/test_android.py#235

* set-debug-app & friends

  * What's the performance impact of `set-debug-app`, assuming we don't add any code that changes behavior (which currently happens)?
  * What's the performance impact of the code that `set-debug-app` inspires? Can we do better? There's a 756ms difference between FNPRMS without the flag & with the flag + conditioned profiles

Yes, I'm very concerned about this one in particular. It might be trickier to investigate without a rooted device. Let me know if I can help.

* MAIN: goes directly to onboarding instead of the homescreen ([mozilla-mobile/fenix#13470](https://github.com/mozilla-mobile/fenix/issues/13470) ?)

Let me run this locally and I'll see what I can find.

mcomella commented 4 years ago

edit: reduced action items in https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-689174357

I regrouped the action items from https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-685204286 into more actionable focus areas and bolded the ones I think we should address in this issue, before replacing FNPRMS VIEW with mach perftest:

Accuracy of results

some can be done later, as long as we can still catch regressions; note: results not comparable to other apps & may see perf characteristics different from real devices

[ ] [acreskey] conditioned profiles slower than no conditioned profiles
Performance tuning
[x] perf impact of set-debug-app without code that leverages it?
[ ] perf impact of our code leveraging set-debug-app? (756ms diff on P2)
[ ] investigate if we transmit adb data throughout test? if so, there is a perf impact
[x] Each run in a test run on the GS5 gets longer (fixed by perf tuning?)

Fix MAIN

want to get VIEW working first...

[acreskey] MAIN goes to onboarding (https://github.com/mozilla-mobile/fenix/issues/13470 ?)
Validate perftest timing

Noise reduction

can be done later as long as current noise is tolerable; what's the current delta between runs?

[x] add small delays to get perftest close enough to FNPRMS
- Optimize delays
[ ] Validate current noise is tolerable
- Device-to-device
iteration count

Testing artifacts

not that important right not

logcat output is truncated (I assume it's because my device has a small logcat buffer and logs are pulled at the end)

edit: acreskeyMoz also mentioned:

figuring out all the devices in SF

mcomella commented 4 years ago

  * The logcat is missing data on my GS5: it should take logcat more regularly if we want it all in the artifacts (currently, perftest results capture correctly but I can't send it to FNPRMS without the full logs)

I wonder if this related to the --android-clear-logcat perf test option?

I'm assuming it's because the logcat logs are pulled at the end and my device has a small maximum logcat buffer – I had the same problem with FNPRMS on this device and had to rewrite the code to pull the logs between runs.

acreskeyMoz commented 4 years ago

* [acreskey] MAIN goes to onboarding ([mozilla-mobile/fenix#13470](https://github.com/mozilla-mobile/fenix/issues/13470) ?)

Locally, on Pixel 3 and Moto G5, I'm seeing ./mach perftest skip onboarding, and correctly measure MAIN results. e.g. [1437.0, 1381.0, 1345.0]

If I disable the performancetest intent arg, then I see it launch to onboarding. https://searchfox.org/mozilla-central/rev/2b250967a66886398e5e798371484fd018d88a22/testing/performance/hooks_android_main.py#16-17

So I think this looks like it's https://github.com/mozilla-mobile/fenix/issues/13470 Other developers (mattwoodrow, I believe) also had issues with the feature in local testing. We can verify by having someone who can reproduce the problem remove the conditions around the test.

mcomella commented 4 years ago

add small delays to get perftest close enough to FNPRMS

Referencing the P2, I don't think adding a delay is necessary with conditioned profiles, which seem to add a large delay between runs anyway that gives the device enough time to settle. This is fragile if we decide not to use conditioned profiles, however. Here are the numbers:

baseline FNPRMS: 2.0188s https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-685204286
10 runs: 1.8906999999999996
- (sorted) ['1.814000', '1.835000', '1.849000', '1.855000', '1.868000', '1.889000', '1.901000', '1.916000', '1.950000', '2.030000']
10 runs: 2.0033000000000003
- (sorted) ['1.842000', '1.882000', '1.890000', '1.916000', '2.037000', '2.057000', '2.070000', '2.092000', '2.104000', '2.143000']

There's a lot of variance in these results (we are loading live pages) but with a reduced delay between tests I'd expect the results to slow down (as the device is heat throttled) but that's not what we're seeing so I feel the consistency of results I've seen locally and we saw the other day also validate these numbers.

On the GS5, I see something similar – I'd expect the results to get longer without the delay but they're roughly the same, vaguely in the noise:

baseline: 4.6125s (10 runs, last? 4 kept) ['4.400000', '4.640000', '4.660000', '4.750000'] https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-685235260
4.795s (10 runs, last 4? kept) ['4.450000', '4.620000', '4.750000', '4.810000']
4.6575 (4 runs only) ['4.720000', '4.770000', '4.830000', '4.860000']

An interesting pattern I saw is that each run on this device will increase in time:

PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 14487782.1, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [14368011, 14393805, 14419708, 14449037, 14476338, 14501471, 14528300, 14554244, 14580482, 14606425], "lowerIsBetter": true, "value": 14487782.1, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "firefox"}}

Second run:

PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 15276387.25, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [15237875, 15263327, 15289057, 15315290], "lowerIsBetter": true, "value": 15276387.25, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "firefox"}}

I added this to the action items above.

mcomella commented 3 years ago

perf impact of set-debug-app without code that leverages it?

I built a custom GV that returns false from the isApplicationCurrentDebugApp method that triggers reading the GV configuration YAML that modifies how GV runs, including the code that enables marionette. The FNPRMS VIEW results show that set-debug-app has no impact on application performance:

No set-debug-app: 1.456s
With set-debug-app --persistent: 1.458s

This implies the performance impact is all on the custom code we run when set-debug-app is enabled for fenix.

edit: There may be a small impact for set-debug-app but it seems negligible and it's hard to measure in this test because it's a live page load (i.e. noisy). I re-ran the numbers after running clear-debug-app and got 1.41; I re-added debug-app and got 1.42.

mcomella commented 3 years ago

Our goal is to replace FNPRMS which is currently functioning as a regression detection system. It is not currently being used for, but was and will later be used for comparing performance against other applications including fennec and Chrome. In the current use case, relative values are all that matter. In the latter, inactive use case, absolute values matter.

With these goals in mind, let's look at the remaining action items https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-685849604 critically:

* [ ]  **[acreskey] conditioned profiles slower than no conditioned profiles**

I think we can do this later: conditioned profiles are used to reduce noise and unexpected changes between runs. For catching regressions, I do not think the default of having them enabled will make a significant difference vs. not using them.

* [ ]  **perf impact of our code leveraging set-debug-app? (756ms diff on P2)**

I think we can do this later: I do not suspect this code will add variations that make catching regressions harder. It just seems like an absolute negative diff. We may find false regressions if the performance of this code changes though.

* [ ]  **investigate if we transmit adb data throughout test? if so, there is a perf impact**

Ideally, we'd look into this: this may introduce variation between runs, e.g. depending on the log statements that are called or just because adb can be expensive.

That being said, due to the logging issue I experienced on my GS5, I suspect we're not doing this and we may be okay to put off this investigation.

* [ ]  **Each run in a test run on the GS5 gets longer** (fixed by perf tuning?)

Ideally, we'd look into this: if each run gets longer on the G5 in CI, we're not getting unbiased results between runs. That being said, if every push has the same behavior, it's possible this is negligible.

* [ ]  **Validate current noise is tolerable**

Briefly investigate: I've been told the noise is acceptable to sheriff which might be good enough for us.

So reduced action items:

[x] investigate if we transmit adb data throughout test? if so, there may be a perf impact
[x] Each run in a test run on the GS5 gets longer (fixed by perf tuning?)
[x] Validate current noise is tolerable
[ ] Large variation in runtime between local and CI

mcomella commented 3 years ago

Each run in a test run on the GS5 gets longer (fixed by perf tuning?)

I looked at the performance tests for a recent Treeherder revision (G5 treeherder, G5 perf data 1, G5 perf data 2, P2 treeherder, P2 perf data 1, P2 perf data 2).

A look at the specific perf data shows the results are not increasing and not impacted by this problem.

mcomella commented 3 years ago

Validate current noise is tolerable

Using the results from the comment above, the results seem just as noisy as FNPRMS though we do fewer runs in mach perftest.

On the P2 14-day graph, we see a diff of 68ms (1302ms to 1370ms)
In the P2 replicants for data 1, we see a diff of 229ms or 115ms if we remove a possible outlier (1258ms to 1487ms or 1373ms if we remove as outlier)
This is replicated locally: I have a diff of 210ms (1507ms to 1717ms on the P2)

FNPRMS VIEW diff is 198ms in 10 runs (numbers from debug-app on with no local modifications to the build but it reproduces without debug-app too) though it's ~100ms max on the P2 between days (graph; I looked at mid-August numbers due to recent regression).

I don't think it's worth investigating further but perhaps we want to increase the iteration count to match FNPRMS (we're at 14 on perftest & 25 on FNPRMS).

Another interesting tidbit: on the P2, I see 1538ms locally but this test is 1319ms: I believe something must be configured differently. Possible causes:

perf-tuning is enabled in CI
conditioned profiles are disabled in CI

acreskeyMoz commented 3 years ago

Another interesting tidbit: on the P2, I see 1538ms locally but this test is 1319ms: I believe something must be configured differently. Possible causes:
* perf-tuning is enabled in CI

* conditioned profiles are disabled in CI

perf-tuning is enabled on P2 in CI (but not G5 since it introduced noise) https://searchfox.org/mozilla-central/rev/b2716c233e9b4398fc5923cbe150e7f83c7c6c5b/taskcluster/ci/perftest/android.yml#90

Conditioned profiles are enabled for both devices: https://searchfox.org/mozilla-central/rev/b2716c233e9b4398fc5923cbe150e7f83c7c6c5b/taskcluster/ci/perftest/android.yml#96-98

mcomella commented 3 years ago

I don't think it's worth investigating further but perhaps we want to increase the iteration count to match FNPRMS (we're at 14 on perftest & 25 on FNPRMS).

Spoke to sparky about noise today: we're concerned about increasing iterations because the tests already take 30-40min to run. However, the theory is we keep iteration count low and run per-commit, we can see where regressions are introduced by looking at multiple runs, rather than getting each commit exactly right. We also have the ability to retrigger to get additional data making accuracy on every run less important.

I think there's nothing to do here to reduce noise until we actually start looking for regressions.

mcomella commented 3 years ago

investigate if we transmit adb data throughout test? if so, there may be a perf impact

Sparky mentioned we run adb logcat at the end; I can't think of other adb commands we'd be running continually (the ones I can think of are all commands that dump info) so I think we can stop investigating this for this MVP effort.

mcomella commented 3 years ago

Large variation in runtime between local and CI

perf-tuning is enabled on P2 in CI (but not G5 since it introduced noise) https://searchfox.org/mozilla-central/rev/b2716c233e9b4398fc5923cbe150e7f83c7c6c5b/taskcluster/ci/perftest/android.yml#90

I enabled perf-tuning locally and got times of 1624ms, compared to 1319ms on CI. Looking into it...

mcomella commented 3 years ago

I got 1521ms from running the latest nightly-simulation build; perhaps I will try to compare my args and builds against those run in CI to make sure I'm running in an identical situation. Then I suppose I can try comparing the logs.

acreskeyMoz commented 3 years ago

I got 1521ms from running the latest nightly-simulation build; perhaps I will try to compare my args and builds against those run in CI to make sure I'm running in an identical situation. Then I suppose I can try comparing the logs.

Is your local device rooted? If it's not the perf-tuning will be skipped by the test harness.

Another thing I haven't looked at is the variance from one device to device (e.g. one Pixel 2 to another).

Although in CI it's 14 iterations on one device from a pool, and we're not seeing a huge amount of noise from device to device.

acreskeyMoz commented 3 years ago

* [ ]  **[acreskey] conditioned profiles slower than no conditioned profiles**

With some work I made a geckodriver that automatically enables Fenix startup profiling.

The root cause of this discrepancies looks to be the scanning of the addons database for changes (which we only see in conditioned profiles).

I've logged this one and I'm following up with the addons folks: https://bugzilla.mozilla.org/show_bug.cgi?id=1664025

mcomella commented 3 years ago

Is your local device rooted?

No. I tried disabling perf tuning on CI instead but I get roughly the same result: 1324ms.

I tried to match the mach perftest arguments from CI but I still get large values 1553.8ms. The only one I haven't been able to match yet is --browsertime-geckodriver ${MOZ_FETCHES_DIR}/geckodriver because I don't know how and I intentionally ignored --android-install-apk fenix_nightlysim_multicommit_arm64_v8a because I have an APK already installed on my device.

mcomella commented 3 years ago

I tried downloading the same build as the per-commit run in the try push above but still got similar results.

I tried using the --browsertime-geckodriver arg by getting a recent version of geckodriver linked by sparky but got the same results. Here's my full arg list:

#!/usr/bin/env zsh

#./mach perftest \
python3 python/mozperftest/mozperftest/runner.py \
    --flavor mobile-browser \
    --android \
    --android-app-name org.mozilla.fenix \
    --perfherder-metrics processLaunchToNavStart \
    --android-activity org.mozilla.fenix.IntentReceiverActivity \
    --android-clear-logcat \
    --android-capture-logcat logcat \
    --android-perf-tuning \
    --hooks testing/performance/hooks_android_view.py \
    --perfherder \
    --perfherder-app fenix \
    --browsertime-iterations 10 \
    --browsertime-geckodriver /Users/mcomella/Downloads/geckodriver \
    --profile-conditioned \
    --profile-conditioned-scenario settled \
    --profile-conditioned-platform p2_aarch64-fenix.nightly \
    --output artifacts \
    testing/performance/perftest_android_view.js

The only leads I have left are:

validate that perf tuning was actually off in the try build (I think something is logged when it's on).
could it be slower because of some configuration on the host machine? In particular, I'm running macOS and I assume CI is not
- check the logcat logs to see differences in PageStart/Stop timings and such?
could it be slower because marionette is taking longer to connect? e.g. maybe the CI devices are actually tethered to the host machines rather than 2x connections going over wifi

mcomella commented 3 years ago

In addition to the leads above, sparky suggested:

I run the test locally with a rooted, fairly blank Pixel 2 as a test device rather than my personal, unrooted phone

I took two profiles to try to understand the root cause of why CI perftest takes longer than my perftest:

VIEW, normal start: with adb shell am start: https://share.firefox.dev/3in5yBR
VIEW with mach perftest: https://share.firefox.dev/3c1GJsR

My thinking is that if I can identify where mach perftest is taking a long time compared to normal start ups, it might give me hints as to why my runtime is longer than mach perftest.

mcomella commented 3 years ago

What I got from the profiles so far:

In normal, Navigation::Start is 1.854s; in perftest is 2.324s = 470ms diff (my FNPRMS raw to perftest diff was 573ms to page load complete https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-684110050 – taking noise into account, it's possible the delay all occurs before Navigation::Start, assuming these are normal profiles)
In perftest, marionette adds at least 155ms to start up: I searched "marionette" in my profiles and it's 165ms to 10ms in the Flame Graph

Noise-related?:

In normal, gecko seems to start faster: the parent process profile starts 1.41s and it's 1.44s on perftest = 30ms
In perftest, android.graphics.HardwareRenderer.nSetStopped() makes the performDraw call take 80ms more than in regular start up (could be a bad profile though) – I wonder if GV init is locking the renderer in some way

acreskey noted:

FYI, whimboo just landed a patch that defers the loading of a whole bunch of JSM imports in marionette. https://bugzilla.mozilla.org/show_bug.cgi?id=1660881#c9 So it will be worth checking to see if we see this in the CI VIEW tests.

It could be that some part of the 155ms I mention above has gone away in the next nightlies.

mcomella commented 3 years ago

I wonder if the addons DB scanning is also related here (200-300ms delay on acreskeyMoz's P2): https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-689794541 I'm not sure if my normal start up was run with a conditioned profiles or not.

mcomella commented 3 years ago

I wonder if the addons DB scanning is also related here (200-300ms delay on acreskeyMoz's P2)

I see a 390ms diff in checkForChanges between the two profiles (curiously, acreskey mentions a 200-300ms delay). Combined with the marionette's 155ms, that's 545ms. ~~I saw a 470ms diff from a normal start-up to the perftest start up w/ conditioned profiles: it's possible the cause are these two items.~~ The bulk of marionette occurs after navigation start so nvm that last statement.

The problem I'm trying to understand is why my local runs take longer than CI. I wonder if it could be caused by either of these.

FYI, whimboo just landed a patch that defers the loading of a whole bunch of JSM imports in marionette. https://bugzilla.mozilla.org/show_bug.cgi?id=1660881#c9 So it will be worth checking to see if we see this in the CI VIEW tests.

Now that marionette may be fixed, maybe it's worth re-running and seeing if I still get such a large discrepancy. If it's fixed, I can probably say whether marionette caused it or not and if it's more likely to be the checkForChanges code (which the discrepancy between the length on my device and acreskey's points to that).

mcomella commented 3 years ago

(curiously, acreskey mentions a 200-300ms delay).

I think this actually came from my numbers of perftest conditioned vs non-conditioned, not acreskey's P2.

mcomella commented 3 years ago

Because it's Monday and I don't know where I am anymore...

Summary so far, targeting current problems:

Our goal is to replace FNPRMS for:

(now) regression detection
(later) to measure absolute performance against a baseline (Fennec, Chrome, etc.)

For replacing a regression detection system, we care about:

Equivalent, or reduction of, noise
Perf regressions or improvements to the code are represented accurately

We've learned that the noise appears to be the same between FNPRMS and mach preftest for individual iterations (local measurements) https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-689203275 but has more noise for a aggregated run due to a reduced iteration count but we're choosing to do nothing at the present https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-689773683

Current problem: accurate representation of perf changes

We're still trying to ensure perf regressions or improvements are represented accurately. Here are some key measurements (local on P2 w/ 10 iterations on Nightly 200914 06:05, taken today 9/14, mc b21d31971a86, unless otherwise specified):	Test	FNPRMS logcat measure
FNPRMS raw	1.2906s
FNPRMS cond prof hack	1.969s (14 days ago; today = crash)
mach perftest cond prof	2.0689s	1626.9ms
mach perftest non-cond prof	1.6363s	1315.3ms
[CI - fenix commit 05857ba55] mach perftest conditioned profiles	1.6466s (logcat)	1322.1ms; from this Treeherder job

From these numbers, we know:

that perftest is doing more than FNPRMS raw: it's 346ms - 778ms slower than FNPRMS
that FNPRMS + cond prof is in the noise to perftest + cond prof

And some concerns/questions:

that FNPRMS + cond prof is still always faster than perftest + cond prof & FNPRMS raw is faster than perftest no cond prof: is perftest still doing more than FNPRMS cond prof?
mach perftest in CI should run with condprof but its runtime matches my run without condprof
- Note: previously I was trying to debug why my local runs were slower than CI (via https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-689203275) but I didn't notice this discrepancy

mcomella commented 3 years ago

Broad approaches

So far, we've been trying to understand what mach perftest is doing differently from FNPRMS to theoretically verify it would accurately represent changes the the code. Instead of, or in addition to, this, we could:

Intentionally land regressions or improvements on perftest & FNPRMS and measure the difference (it'd be hard to validate for all types of changes though)

Theories of cause of issues

CI has a rooted P2 while mine is unrooted
My P2 is my personal phone filled with many apps & such
marionette and checkForCHanges in the addons DB add some additional runtime with condprof: https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-692184879
condprof isn't exactly working as expected

Consider for later

perftest running extra code during tests could cause unrelated regressions/improvements (e.g. when the test harness or test code changes), making it difficult to track true regressions and improvements (e.g. lazy loading marionette https://bugzilla.mozilla.org/show_bug.cgi?id=1660881#c9)
Open bullet points above: further reducing noise, fix MAIN, condprof taking longer than non-condprof, ...?

mcomella commented 3 years ago

acreskeyMoz took numbers on the G5:

Running perftest view locally on my G5 (unrooted, not a personal device) ~3340ms (overall score for a run is ~3300ms to ~3400ms) CI G5 (rooted): ~3150ms (reproduced over multiple runs) https://treeherder.mozilla.org/perf.html#/graphs?series=try,2611385,1,13&selected=2611385,1219705978

This generally lines up with what we're seeing on the P2.

mcomella commented 3 years ago

acreskey verified network latency applied to the the host machine and the device doesn't seem to impact test time:

I can do right now quite easily: use my macbook as the hotspot for the device and throttle my Mac via Network Link Conditioner. acreskey So far I don't see much if any impact from pretty severe throttling (i.e. painfully slow to navigate web):

G5 to MacBook: 3321.85

G5 to MacBook @ 3G throttling (100ms delay up + 100ms down + minimal bandwidth ) 3381.857

I'll crank up the latency and see.

Was there some blocking network call in firefox startup? Now running at 500ms round trip latency and I'm seeing similar numbers. So that's good for reproducibility of environments, anyway..

Also, we're seeing different run time for different folks locally on the G5:

acreskey G5 (unrooted, not a personal device) ~3340 CI G5 (rooted) ~3150 mleclair G5 (unrooted, not a personal device) 3999, 3618, 3529.5 so that's a fresh install of the app, as per the command, but the numbers are shifting...

mcomella commented 3 years ago

I added some long term concerns I've dabbled on here to future of cold startup meeting agenda.

I previously validated noise on the P2. At MarcLeclair's suggestion, I just checked the noise on GS5/G5:

FNPRMS GS5: diff of 320ms (10 runs)
perftest GS5: broken: it just keeps increasing
perftest G5 on CI: diff of 312ms (data)

Seems like FNPRMS and perftest are roughly the same amount of noise, given the difference in test endpoint (navStart vs. pageStop) and device. Seems fine to me.

mcomella commented 3 years ago

I made a try run without conditioned profiles: it is 1120ms, which is faster than the conditioned profiles runs we've been seeing (1322ms https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-692246058) and the difference is consistent with what I see locally.

However, that means CI is still faster for an unknown reason. I suppose our best lead is rooted vs. unrooted or personal phone vs. non-personal phone.

mcomella commented 3 years ago

Next steps:

(waiting) Test rooted, non-personal P2 on perftest
(not sure what else to look for) Compare profiles of normal run to local perftest profile to get ideas (confirmed checkForChanges and marionette): https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-691331138

mcomella commented 3 years ago

We also decided to disable condprof given that it wasn't running what users were experiencing:

acreskey mcomella: I found an inherent problem with conditioned profiles and the multicommit test: The addon check will run on startup if the Services.appinfo.version doesn't match between profile and binary. (This changes with gecko versions). The multicommit fenix test uses one conditioned profile for all of the commits. But if the geckoversion changes midway through, the Services.appinfo.version will not match the conditioned profile, thus incurring a slower startup. There may be other problems with using the conditioned profiles, but going back to your options from yesterday:

• Run without condprof • Run with condprof knowing we're testing a code path that isn't common • Fix the bug in automated confprof conditioning So far I don't think the last is solvable without adding more complexity around this. acreskey Although I like conditioned profiles in general, because of these issues I'm personally thinking that it might be best to simply not use them in this use case. mcomella I agree that it makes sense to disable them for now – they seem useful but I'd rather have a simpler test we understand the limitations of (and better matches user experiences) than one we don't really understand Let's add more layers when we sure they're improving the outcome 🙂 (which is why FNPRMS was so minimal - we had no time to add layers 😁)

mcomella commented 3 years ago

I took new startup profiles on Nightly 200916 18:07 because I wasn't sure if the old profiles https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-691331138 used condprof or not:

normal adb VIEW: https://share.firefox.dev/32C3FeX
perftest VIEW non-condprof: https://share.firefox.dev/3c3q1cv

Before taking the profiles, I:

cleared data
launched the app
opened apple.com through manually entering it & waited for it to load
(if I remembered, otherwise done when capturing profile) enabled Remote debugging
force-stopped the app
took a profile

We can use these profiles to:

understand what code perftest adds over normal start-up
hypothesize why perftest in CI is faster than perftest locally

mcomella commented 3 years ago

We got runs on a non-personal, rooted P2. Conditioned profiles:

PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1498.9, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1495, 1503, 1502, 1627, 1487, 1435, 1474, 1500, 1478, 1488], "lowerIsBetter": true, "value": 1498.9, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1489.6, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1470, 1487, 1462, 1495, 1476, 1508, 1504, 1490, 1496, 1508], "lowerIsBetter": true, "value": 1489.6, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1480.1, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1488, 1466, 1478, 1502, 1499, 1448, 1488, 1464, 1484, 1484], "lowerIsBetter": true, "value": 1480.1, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1490.9, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1515, 1513, 1505, 1483, 1502, 1458, 1403, 1523, 1522, 1485], "lowerIsBetter": true, "value": 1490.9, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1491.6, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1488, 1491, 1484, 1495, 1494, 1509, 1467, 1498, 1511, 1479], "lowerIsBetter": true, "value": 1491.6, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}

Non-cond prof:

PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1071.8, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1038, 1062, 1112, 1060, 1062, 1085, 1060, 1071, 1069, 1099], "lowerIsBetter": true, "value": 1071.8, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1094.4, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1082, 1106, 1083, 1113, 1110, 1107, 1089, 1098, 1085, 1071], "lowerIsBetter": true, "value": 1094.4, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1075, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1069, 1074, 1082, 1069, 1085, 1067, 1069, 1070, 1107, 1058], "lowerIsBetter": true, "value": 1075, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1100.4, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1123, 1086, 1106, 1103, 1183, 1074, 1082, 1093, 1080, 1074], "lowerIsBetter": true, "value": 1100.4, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}
PERFHERDER_DATA: {"suites": [{"name": "VIEW", "type": "pageload", "value": 1087.7, "unit": "ms", "extraOptions": [], "lowerIsBetter": true, "alertThreshold": 2.0, "shouldAlert": false, "subtests": [{"name": "browserScripts.pageinfo.processLaunchToNavStart", "replicates": [1080, 1101, 1108, 1125, 1072, 1074, 1084, 1100, 1059, 1074], "lowerIsBetter": true, "value": 1087.7, "unit": "ms", "shouldAlert": false}]}], "framework": {"name": "browsertime"}, "application": {"name": "fenix"}}

This basically matches CI (non-cond prof run). That means there is a discrepancy in my local setup.

Resummarize

To follow-up on the last summary https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-692246058, the problem we're trying to solve is ensuring mach perftest will accurately represent performance changes. We've seen a few red flags:

Conditioned profiles are slower than non-conditioned profiles and seem to run code outside of the user start path: we disabled conditioned profiles for now
My local runs are slower than CI. However, someone else's runs are the same: what is different about my set-up

Action items

To address 2), I believe we could do one of:
- I could acquire a P2 that I can root and not have personal things on and see if it performs the same as CI
- We can get a perftest profile from the other P2 we used and compare against mine
- (round-about) We could get an unrooted, non-personal phone test run from someone else
We should document what extra code runs during perftest so that we're aware of shortcomings of that code (e.g. if that code changes and regresses the tests, we know it's a possible cause)
We might be hitting diminishing returns: perftest seems fairly solid. Can we just start using it for performance testing right now? We might be able to worry about the details later.

mcomella commented 3 years ago

acreskeyMoz agreed that we may want to just start using the system. We will:

[x] Document known differences between normal start-up and mach perftest startup (doc)
[x] File follow-ups to document more differences and find root cause of rooted/unrooted discrepancy
[x] Figure out how to start using perftest for regression detection (new issue?)
[x] Scan this bug for anything else we need to do

mcomella commented 3 years ago

File follow-ups to document more differences and find root cause of rooted/unrooted discrepancy

mcomella commented 3 years ago

Figure out how to start using perftest for regression detection (new issue?)

https://github.com/mozilla-mobile/perf-frontend-issues/issues/162

Sounds like this investigation is done! 🎉

mozilla-mobile / perf-frontend-issues