Add performance tests - Githubissues

Things we're interested in:

How much time this component adds to first paint of an app?
How much time this component adds to time-to-interactive of an app?
What is the full payload size (including dependencies) of this component?
How to track our performance to spot regressions?

Let's timebox a 2d research figuring out what kind of a tool we should build for this. Some polyperf/ lighthouse based or something else?

Did some testing with Lighthouse.

It allows you to define a custom set of rules to be checked and also supports adding custom "audits" and data "gatherers" – so in theory, we can do pretty much anything with the data we can read from Chrome.

The output can be formatted into JSON which can be then be asserted. What we are missing is a service/tool to store results from previous runs so we could assert the diffs.

As an MVP the easiest thing could probably be to use a custom ruleset which runs the performance audit rules and then we could have a hard coded threshold limit for each component.

Used the following setup to test:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title></title>
    <link rel="import" href="../vaadin-context-menu.html">
  </head>
  <body>
    <vaadin-context-menu></vaadin-context-menu>
  </body>
</html>

{
  "passes": [{
    "recordNetwork": true,
    "recordTrace": true,
    "gatherers": []
  }],

  "audits": [
    "first-meaningful-paint",
    "speed-index-metric",
    "estimated-input-latency",
    "time-to-interactive",
    "user-timings",
    "critical-request-chains"
  ],

  "aggregations": [{
    "name": "Progressive Web App",
    "description": "These audits validate the aspects of a Progressive Web App.",
    "scored": true,
    "categorizable": true,
    "items": [{
      "name": "Page load performance is fast",
      "description": "Users notice if sites and apps don't perform well. These top-level metrics capture the most important perceived performance concerns.",
      "audits": {
        "first-meaningful-paint": {
          "expectedValue": 100,
          "weight": 1
        },
        "speed-index-metric": {
          "expectedValue": 100,
          "weight": 1
        },
        "estimated-input-latency": {
          "expectedValue": 100,
          "weight": 1
        },
        "time-to-interactive": {
          "expectedValue": 100,
          "weight": 1
        },
        "scrolling-60fps": {
          "expectedValue": true,
          "weight": 0,
          "comingSoon": true,
          "description": "Content scrolls at 60fps",
          "category": "UX"
        },
        "touch-150ms": {
          "expectedValue": true,
          "weight": 0,
          "comingSoon": true,
          "description": "Touch input gets a response in < 150ms",
          "category": "UX"
        },
        "fmp-no-jank": {
          "expectedValue": true,
          "weight": 0,
          "comingSoon": true,
          "description": "App is interactive without jank after the first meaningful paint",
          "category": "UX"
        }
      }
    }]
  }, {
    "name": "Performance Metrics",
    "description": "These encapsulate your app's performance.",
    "scored": false,
    "categorizable": false,
    "items": [{
      "audits": {
        "critical-request-chains": {
          "expectedValue": 0,
          "weight": 1
        },
        "user-timings": {
          "expectedValue": 0,
          "weight": 1
        }
      }
    }]
  }]
}

Results

(Lighthouse uses network throttling by default, hence the load times are so long. We might want to keep the throttling if we want to use the score value, otherwise they go pretty much to 100 always)

(btw, making the vaadin-context-menu-overlay.html importing happen async after attach boosts the "first meaningful paint" score to 96)

Lighthouse (1.2.2) results: http://localhost:8082/components/vaadin-context-menu/test/lighthouse.html

▫ Progressive Web App

Page load performance is fast:
 ── 89 First meaningful paint (2151.7ms)
 ── 91 Perceptual Speed Index (2162)
    - First Visual Change: 2162ms
    - Last Visual Change: 2162ms
 ── 100 Estimated Input Latency (16ms)
 ── 93 Time To Interactive (alpha) (2152.4ms)

▫ Performance Metrics

 ── ✘ Critical Request Chains (16)
    - Longest request chain (shorter is better): 2
    - Longest chain duration (shorter is better): 2049.15ms
    - Longest chain transfer size (smaller is better): 138.08KB
    - Initial navigation
      ┗━┳ test/lighthouse.html (localhost)
        ┣━━ vaadin-context-menu/vaadin-context-menu.html (localhost) - 368.76ms, 11.04KB
        ┣━━ vaadin-context-menu/vaadin-contextmenu-event.html (localhost) - 525.66ms, 3.47KB
        ┣━━ vaadin-context-menu/vaadin-context-menu-overlay.html (localhost) - 539.47ms, 3.63KB
        ┣━━ paper-styles/shadow.html (localhost) - 747.02ms, 3.35KB
        ┣━━ vaadin-context-menu/vaadin-device-detector.html (localhost) - 753.96ms, 2.27KB
        ┣━━ iron-media-query/iron-media-query.html (localhost) - 1012.36ms, 3.63KB
        ┣━━ iron-overlay-behavior/iron-overlay-behavior.html (localhost) - 1097.30ms, 21.03KB
        ┣━━ iron-resizable-behavior/iron-resizable-behavior.html (localhost) - 1355.00ms, 7.06KB
        ┣━━ polymer/polymer-micro.html (localhost) - 1397.58ms, 18.82KB
        ┣━━ iron-fit-behavior/iron-fit-behavior.html (localhost) - 1448.30ms, 20.05KB
        ┣━━ iron-overlay-behavior/iron-overlay-manager.html (localhost) - 1512.56ms, 11.06KB
        ┣━━ iron-overlay-behavior/iron-focusables-helper.html (localhost) - 1684.05ms, 9.40KB
        ┣━━ iron-overlay-behavior/iron-overlay-backdrop.html (localhost) - 1734.04ms, 5.28KB
        ┣━━ polymer/polymer-mini.html (localhost) - 1769.31ms, 53.91KB
        ┣━━ iron-a11y-keys-behavior/iron-a11y-keys-behavior.html (localhost) - 1827.72ms, 16.42KB
        ┗━━ polymer/polymer.html (localhost) - 2049.15ms, 138.77KB

 ── ✓ User Timing marks and measures (0)

Status of the Art

Most perf frameworks/libraries aim to measure server side responses
Other libraries are focused on benchmarking unit algorithms or functions
Travis does not cover performance test cases like Jenkins do.
- So there is no way out-of-the-box to store historical data, compare with current build and configure thresholds
- They don't plan to cover this so far.

Web Components

For web components we have two solutions as @Saulis said

PolyPerf

It is a polymer labs project. No so much activity lately (I sent a fix PR btw)
But it's easy to integrate with web-components-tester
It's code is very simple, we can easily improve it in case.
Only works in Chrome
The way it works, is that it loads multiple times the same test in an iframe and gives stats (mean, variance,
- You can configure in your test when the measure starts and when it's done
- Hence you can test any action in your component: load, scrolling, events
- Also you can omit 3party dependencies measurements. For instance early loading certain dependencies before starting your component import and test.
It lacks of history data
- Though, it could be handled with low effort with some storage wc (cloudant, firebase)

Prototype code: https://github.com/vaadin/vaadin-context-menu/commits/proto/perf-tests

LightHouse

It's a CLI utility to measure performance of websites, very focused in PWA
- Though we can report only the performance data with the --perf flag
- And we can generate a .json report file
~~It needs a plugin loaded in Chrome.~~
- ~~Not sure if it can be done in SauceLabs~~
Only works in Chrome
We need a way to parse the output because exit status does not help
We also need a way to store historical data
The main caveat is that it only reports performance when loading the full page and it is ready
- It includes the element to test, but also all other dependencies, you have to measure as a whole.
- You cannot decide what to measure, when it starts and ends
It has some magic to figure out when the user can interact with the page.
Another feature is that it can reduce network performance, etc.

Prototyping

Working with both solutions, there is a prototype able to run LightHouse and Polyperf tests in travis.

https://github.com/vaadin/vaadin-context-menu/commits/proto/perf-tests

Polyperf tests are run using wct. There is only a test that measures the time between the page loads the component to test and the event opening the overlay happens. It loads the page to test 10 times in an iframe and computes the mean, variance and desviation of all measurements. We compare the results with thresholds
LightHouse tests are run using a shell script. Multiple tests can be configured to be run in different processes. In this case there is a simple tests loading the component, and also a test loading the demo page. When the test are run, the script is able to read the output json file and take the overall performance value. We compare this value with a threshold to make the test pass or fail.
Travis is configured to run these tests directly without using Sauce Labs.

screen shot 2017-01-13 at 09 01 11

Issues

Final measurement values are very dependent of the platform and machine where the test is run.
We are not storing historical values right now to compare with
We don't have any control test so as it works as the base scale to compare measurements

Ideas

These are some actions that we can take, not necessarily all at once

Store historic data: We can use a web-service or a cache directory in travis. Data should be linked with the platform where it was run. So we need as many historical data as systems where the test is run
Use a dedicated server for performance tests, in order to have stable and reliable historic data.
- I propose to have a Jenkins server in our infra, only for Elements. It should run a process at once to avoid interferences. It has out-of-the-box performance plugin able to deal with historic data, thresholds, charts etc.
Use a control page to set the base threshold. We can have for instance a <hello-word> component that just loads polymer and draws a message when ready.

Prototype enhancements

Due to the lack of a reliable way to repeat tests in the same conditions, control tests have been added to the prototype.

In both cases, PolyPerf and LightHouse, a new control-test that loads a paper-button component in the page has been added. Once the control test is run, we use its time to compute the the threshold for testing the component.

LightHouse results

The script lasts 70s to run the performance analysis for all the pages: the control page light/control.html, the simple component test light/test.html, and the demo index demo/index.html, all of them in shady and shadow modes.

The test would fail if the performance of any of our test pages performs 10% worst than the control threshold=control_total * 1.1

>>> Running test: http://localhost:3000/vaadin-context-menu/test/light/control.html
 >> lighthouse total=0.485 threshold=0.5335 test=test/light/control.html

>>> Running test: http://localhost:3000/vaadin-context-menu/test/light/test.html
 >> lighthouse total=0.46 threshold=0.5335 test=test/light/control.html status=0
>>> Running test: http://localhost:3000/vaadin-context-menu/test/light/test.html?dom=shadow
 >> lighthouse total=0.47000000000000003 threshold=0.5335 test=test/light/control.html status=0

>>> Running test: http://localhost:3000/vaadin-context-menu/demo/index.html
 >> lighthouse total=0.35000000000000003 threshold=0.5335 test=test/light/control.html status=0
>>> Running test: http://localhost:3000/vaadin-context-menu/demo/index.html?dom=shadow
 >> lighthouse total=0.4325 threshold=0.5335 test=test/light/control.html status=0

PolyPerf results

WCT runs two tests, the first one is the control page test (perf-paper-button.html) which computes the perf value that would be used as the base for the threshold limit = control.mean * 1.8. The second one is the component test (perf-vaadin-context-menu.html) and should not perform worst than a 80% than the control.

For some reason we need a greater multiplier than we used with lighthouse, it seems that it's related with the number of the dependencies needed.

Measured values in the prototype are:

CONTROL RUN, values: Object {mean: 218.5, variance: 1165.25, deviation: 34.13575837739657}
TEST RUN, limit: 371.45 values: Object {mean: 404.2, variance: 865.5599999999998, deviation: 29.420401084961433}

Cache Issues

The way how PolyPerf works is that it reloads N times (10 in our case) the same page in an iframe. By using the network console, it happens that only the first time the components are full loaded, but the rest of times the requests response is a not-modified status. Apparently there is no way to reset the cache programmatically to force all dependencies to be downloaded again. An option could to load the test just one time instead of N times, but it's not reliable enough. Better option might be to configure WCT server so as it does not send cache info or not-modified responses, or set some options to selenium to disable browser cache, but I have not found a way yet.

LightHouse apparently does not to use the cache because it loads the page just once, and our test script restarts the browser in each page.

Conclusions

Adding LightHouse to any component seems very easy, just in case it's alright for us to have the aggregated performance total reported by it.
Develop LigtHouse plugins to test specific features could be very tough.
PolyPerf still needs some more research to find options to disable cache and get more reliable results
Tough, PolyPerf seems more suitable to test specific features and to integrate with wct.

vaadin / vaadin-context-menu

Add performance tests #61