Start using mach perftest VIEW for regression detection

mcomella commented 3 years ago

We've hit diminishing returns in our validation investigation https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-694429310 and we believe it's ready to be used.

Along with this effort, we should consider including this data into the existing FNPRMS dashboards.

mcomella commented 3 years ago

I think the per-commit data is too noisy to identify a specific commit that caused a regression. The following graphs are using pixel 2 data and I generated them using sparky's script with these changes (waiting to merge into master).

Here's FNPRMS over the past 30 days, 25 replicates per data point:

There is a distinct trend upwards.

Here's the perftest's per commit data over roughly the same period, 14 replicates per data point:

You can't really see any trends and the data seems to regularly vary between 1300ms to 1360ms (not good enough to catch small regressions we've been seeing, ~10-40ms). To be fair, it's unclear if the perftest methodology has been changing during that period. (note: the drop at the end might be from us disabling conditioned profiles)

Here's a recent week:

Here's a graph where, for a given date, we take the median per-commit run per day:

The graph seems less noisy but is missing the gradual trend we see on FNPRMS. This is concerning: when perftest runs are viewed in isolation, they seem pretty similar to perftest but when viewed in aggregate, the trends we see in FNPRMS are missing. What could be the cause?

for per-commit graphs, could the difference of 14 replicates per data point compared to 25 replicates make such a big difference in noise? From 20-60ms down to <= 10ms?
for the 1 median per day data, since there are not an equal number of commits to run on per day, some of the results may be outliers
perftest goes to navStart instead of a full page load but theoretically that should reduce noise

Possible next steps:

increase the replicate count to 25: this is probably the easiest way to make progress and compare
the amount of noise in aggregate might change by us disabling conditioned profiles (could be the answer or could complicate further analysis because it breaks historical data)
wait until this is sheriffable (waiting on surfacing per-commit data in treeherder + alerting?) and potentially have someone else figure out how to make this data more usable

@acreskeyMoz Do you have any ideas about what we should next? To be explicit, I'm trying to get perftest to a place where we can use it to detect regressions per commit. Theoretically, this script should let us do that but there seems to be too much noise.

mcomella commented 3 years ago

To be explicit: viewed from this perspective, perftest is not yet a good substitute for FNPRMS as a nightly measurement system, nevermind per-commit regression detection.

gmierz commented 3 years ago

One thing to note here is that these comparisons are between a small pool of devices (FNPRMS) versus a larger pool of devices (perftest). You might want to isolate a time series for each device in perftest and see if either of them, or some combination of them (if fnprms uses more than one) match up to FNPRMS. It's possible that the regressions detected in FNPRMS are biased because your sample/test-subject size is too small.

acreskeyMoz commented 3 years ago

This is my suggestion:

We take the raw results from these runs:

• perftest in CI with 25 iterations • perftest in CI with 14 iterations • FNPRMS run with 25 iterations • (maybe?) FNPRMS run with 14 iterations

And for each we calculate: • Std Dev • Margin of Error

And compare. Maybe n=30 is the sweet spot for these tests?

Perhaps we'll also want to compare variance. @gmierz was using a Levene test when looking at the perf tuning impact on noise.

I'm happy to do this unless someone else wants to. I do have a couple of days work already scheduled ahead of me though.

gmierz commented 3 years ago

Levene's test should definitely be used here to compare variance between multiple data sources with potentially differing means. It will tell you if there is a statistically significant difference in the variance, and more specifically, it only tells you if they are equal or not. You can then use the %-change in std.dev./errors to determine which one is larger if there is a difference.

I still think that the difference in devices should be looked into, and we might be better off using pystones for these timings rather than raw milliseconds.

mcomella commented 3 years ago

I'm happy to do this unless someone else wants to. I do have a couple of days work already scheduled ahead of me though.

@acreskeyMoz If you have the cycles for it, I think it'd be ideal! :) I don't have any statistics background nor do I have much familiarity with how our CI infra is set up so I think it'd be time consuming for me and I'd need to bother you or sparky frequently anyway.

To be explicit: ideally, we could get this ASAP but I think it's fine to wait a few more days especially because I'll be implementing other regression prevention mechanisms while I'm waiting for your results.

I'll assume that you're going to do it but let me know if that's not the case. Thanks!

acreskeyMoz commented 3 years ago

I'm starting on this now -- I expect to have the initial summary for perftest by tomorrow.

I'm also including a variation where the multicommit fenix is replaced with the nightly.

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2? Ideally from a few runs.

mcomella commented 3 years ago

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2? Ideally from a few runs.

Clarified the requirements async:

6-10 times total
Each run on 25 iterations
Whichever Nightly build is fine
Just share the final durations, not the logs

I'm getting P2 numbers and MarcLeclair is getting G5.

acreskeyMoz commented 3 years ago

When I have all the data I'll summarize this, but FYI I'm collecting raw numbers and statistics on them here: perftest and FNPRMS variance

mcomella commented 3 years ago

FNPRMS numbers for the P2 (zip)

acreskeyMoz commented 3 years ago

I've imported all the results (thank you Michael and Marc and CI). So far this is what we're looking at:

Test	Iterations	+/- ms (95% Margin of Error, average from 10 runs)
perftest-g5	14	35.86
perftest-g5	25	21.59
perftest-p2	14	39.82
perftest-p2	25	22.75
fnprms-p2	25	19.11
fnprms-g5	25	27.90

gmierz commented 3 years ago

@acreskeyMoz is the difference in the mean/median between FNPRMS and perftest statistically insignificant or significant? If it's significant, we should be reporting % differences (error/mean) rather than raw ms values.

acreskeyMoz commented 3 years ago

@gmierz In https://github.com/mozilla-mobile/perf-frontend-issues/issues/162#issuecomment-697935474 I'm just showing the calculated margin of error for each of those tests. We can't directly compare FNPRMS and perftest means/medians because they test a different spot on the navigation timeline.

gmierz commented 3 years ago

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

acreskeyMoz commented 3 years ago

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

Ah yes, point taken. We should not be directly comparing the fnprms and perftest margin of errors. (Their means are ~10% different on G5, ~20% different on P2).

But we can see that increasing perftest iterations to 25 (at least on G5, so far) gives us a margin of error that I believe is closer to what @mcomella is looking for.

gmierz commented 3 years ago

Yea I agree, perftest-g5 vs fnprms-g5 is looking promising! :)

acreskeyMoz commented 3 years ago

Added perftest-p2 results to https://github.com/mozilla-mobile/perf-frontend-issues/issues/162#issuecomment-697935474

acreskeyMoz commented 3 years ago

Some simple plots of the first two runs for each device and test. (25 iterations for all tests) From here

mcomella commented 3 years ago

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays https://github.com/mozilla-mobile/perf-frontend-issues/issues/141#issuecomment-686043015, I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

acreskeyMoz commented 3 years ago

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays #141 (comment), I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

Ah yes, that's a good and simple change.

We can discuss today, but so far these look like good improvements:

Add a short delay between tests (browsertime configuration)
Increase perftest iteration count from 14 to 25

acreskeyMoz commented 3 years ago

We decided against prematurely adding delays with out data to show they yield improvements:

Increase iterations to 25: https://bugzilla.mozilla.org/show_bug.cgi?id=1667238

mcomella commented 3 years ago

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

mcomella commented 3 years ago

The script doesn't return results after 9/15 – I'll need to check with sparky about what's going on.

mcomella commented 3 years ago

Sparky mentioned the Active Data system – an elastic search database used by some scripts – is not ingesting data correctly. I filed: https://bugzilla.mozilla.org/show_bug.cgi?id=1668327

To get fenix per-commit regression detection working, I think my options are to:

Get this fixed somehow
Wait for the perf sheriffing work to complete

However, I'm not sure the interface in 2 will allow us to validate that this is a good replacement for FNPRMS (i.e. the steps in https://github.com/mozilla-mobile/perf-frontend-issues/issues/162#issuecomment-700266135).

gmierz commented 3 years ago

@mcomella for the script that relied on AD, I'm going to try to port that visualization to grafana through some work tarek has been doing.

acreskeyMoz commented 3 years ago

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

If we want to compare the FNPRMS trends more closely to perftest, one option is to push jobs to try where the perftest view test measures the time to loadEventEnd:

https://searchfox.org/mozilla-central/source/testing/performance/perftest_android_view.js#14-16

  const processLaunchToNavStart =
    browserScripts.pageinfo.navigationStartTime + browserScripts.timings.loadEventEnd -
    browserScripts.browser.processStartTime;

mcomella commented 3 years ago

Status update: Tarek is working on inserting perftest data into grafana. Once that's complete, we should be able to download the data directly from grafana.

If that takes too long, Sparky has a workaround for the active data issue; it doesn't sound like active data is coming back.

mcomella commented 3 years ago

Talking with pslawless, we're waiting for more cycles (perhaps from the perf test team) to complete this effort.

mcomella commented 3 years ago

This bug isn't super actionable anymore and we're waiting on cycles from the perftest team to move forward with this. Let's reopen or open a new one when we have those cycles.

mozilla-mobile / perf-frontend-issues

Start using mach perftest VIEW for regression detection #162