shinglyu commented 8 years ago

I had some discussion with @larsbergstrom about this, and here is my proposal. Feedbacks welcome!

Goal

Measuring Servo's page load performance on top websites (e.g. Alexa 500), and comparing them to other browser's performance.

Existing Solutions

Firefox Desktop
- Talos - tp5 suite
- PerfHerder (dashboard)
Firefox OS
- b2gperf (deprecated)
- Raptor (dashboard)
  Proposal
  
  Measurements
Page load time
- Using window.performance API
(Optional) Record profiling data during test
Test Environment
(By priority) Linux, OSX, Windows, Android, ARM embedded, other
Test Harness
Python test runner (unittest or py.test)
- wpt-runner?
Test pages are served from a local server to minimize network latency noise
Use WebDriver to execute performance.timing and collect the results
Test Cases
tp5
- 51 manully selected and cleaned site from Alexa top 500 April 8th, 2011
- Outdated, but a good starting point
- Served from local server
Web Page Replay
- Record and replay Alexa top 500 sites
- Need to select representitive pages (i.e. not login or landing page)
- Need a lot of manual labor
- We should also include target sites from our June release plan:
- github.com
- duckduckgo
- hackernews
- reddit
  Visualization & Notificaion
Raptor
- What we used in B2G
- A drag-and-drop dashboard solution with influxDB backend
(Alternative) PerfHerder?
(Alternative) Another AreWe...Yet from scratch?
Automated regression identification and bug reporting
- Use statistical rule to identify perf regressions
- Example: Raptor Regression with slideing T-tests
- Submit the regression bug to GitHub issues
  Plan
Phase 1:
- Serve tp5 pages on a local server
- Run Servo (Linux 64) against tp5 and collect performance.timing data
- Push the data to an influxDB instance
- Plot them in Raptor (hosted on heroku)
- Run against nightly builds on a local Jenkins server
Phase 2:
- Record selected Alexa 500 sites with Web Page Replay
- Automate regression finding and bug reporting
- Compare with Gecko
- Integrate with existing Servo test infrastructure
Phase 3
- Bring up more platforms

autrilla commented 8 years ago

Is this something you plan on doing yourself, or are you looking for someone to work on this? If so, I'm interested :)

larsbergstrom commented 8 years ago

I'd be very interested in seeing page load time as the most important piece of information but also capturing the raw output of the profiling data. @rzambre and I are working on some patches to make that easier to grab for automated systems, which we hope to land very soon.

The only other obvious piece of information that would be really helpful is the output of memory profiling, though I don't know how practical that is to collect for initial page load scenarios. @nnethercote can you comment on whether that would be a useful measure to track or if we need more steady-state browsing data?

CC: @jgraham @jdm @metajack @Ms2ger @edunham whom I expect to have other feedback

Thanks for writing this up - I'm very excited for this work!

jgraham commented 8 years ago

So, generally the idea of using something like tp5 and measuring the load times seems like a sensible first step. I would encourage you to build as little infrastructure as possible though; we already have solutions for monitoring performance data and I suspect that anything you invent now will be about the same amount of effort to get working as reporting to perfherder, but will be more of a maintenance burden in the future. @wlach is the expert here and will be able to provide hints.

For harnesses, wptrunner already provides a mechanism to launch servo, but that's pretty much all you'll be using in this case. It will be possible to adapt to your usecase, but if your plan is literally just to launch servo for each url and read the timing data from stdout (which seems like the easiest implementation for now), then it's probably overkill. I think just using purely custom python code is quite defensible. I would avoid unittest.

I agree that in the future using recorded loads instead of static copies of sites will be a much better simulation of the real world.

wlach commented 8 years ago

Yeah, I'd really encourage you to consider using Perfherder, which not only solves the problem of storing and visualizing performance data but also acting on it. I've spent the last few quarters working on a performance sheriffing view, which we've been using for tracking regressions in Talos and other things:

https://treeherder.mozilla.org/perf.html#/alerts?status=-1&framework=1

Perfherder automatically detects regressions and provides a simple method for filing bugs on them based on a template. We'd probably need to make some minor adaptations to support Servo, but nothing major. I'm in the middle of a similar effort to make Perfherder a good solution for sheriffing AreWeFastYet data, which I think should cover most of your use case: http://wlach.github.io/blog/2016/03/are-we-fast-yet-and-perfherder/

Submitting data to perfherder is not hard, all that is involved is creating a standard treeherder job and adding a "performance artifact" to it (there's plenty of sample code for this). We've used treeherder successfully with github projects before (bugzilla, gaia), so I don't see why Servo would be a problem.

edunham commented 8 years ago

I'm +1 on using perfherder, since you'll almost certainly get better support and performance with a tool that people focus on full time than a one-off competing with many other projects for my, Jack's, and Lars's time. @wlach, does perfherder expose a public API of the data it collects, as well as the built-in metrics visualization?

nnethercote commented 8 years ago

Memory usage on page load would be reasonably useful. Tracking that would be a lot better than tracking nothing.

shinglyu commented 8 years ago

Let me summarize the above discussion:

Use tp5 as a start point
Use a dead simple python script as test harness
Use Perfherder for data collection and reporting
Additional metrics we can collect:
- memory usage: something like AreWeSlimYet?
- profiling

The technique used in AreWeSlimYet seems daunting to me, I'd appreciate if anyone can point me to any tool or document I can study.

@autrilla : Any help would be most welcome :) I'll open a new repo for this project and try to merge it back when it's mature.

shinglyu commented 8 years ago

I started some experiment in this repo: https://github.com/shinglyu/servo-perf

wlach commented 8 years ago

@edunham: Perfherder has a bunch of endpoints for getting series data (the UI uses these):

https://treeherder.mozilla.org/docs/#!/project/Performance_Datum_list https://treeherder.mozilla.org/docs/#!/project/Performance_Signature_list

And also one to get a list of "alerts" (detected changes) programatically:

https://treeherder.mozilla.org/docs/#!/performance/Performance_Alert_Summary_list

Feel free to ask either me or jmaher on irc.mozilla.org #treeherder or #perfherder if you have more questions

shinglyu commented 8 years ago

@wlach Thank you for your information, but I still don't understand how Perfherder works.

So is Perherder just a frontend, and it queries data from Perfherder?
The performance data is actually a bunch of build artifacts from a series of treeherder jobs, we just get them all using the treeherder API? Not aggregated to some database?
I don't have any experience in creating treeherder jobs, is there any documentation or previous patch I can follow?
Do I need to request special permission to run my custom treeherder job on the production server?

Thank you!

shinglyu commented 8 years ago

Oh I found this: https://treeherder.readthedocs.org/submitting_data.html Can I use this?

wlach commented 8 years ago

@shinglyu I think you figured this out for yourself, but yes, that's the guide to use for submitting data. It doesn't cover performance data specifically (at least not yet), but there is some good prior art in autophone that you can hopefully use as a reference:

https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py

Since this is the first time we'll be submitting Servo data to treeherder, we'll also need to send revision information. There's some guidance on doing that in the submitting data document that you linked to. Eventually you might want to consider using TaskCluster for scheduling jobs and submitting data, which I believe might take care of some of those details for you.

To answer your earlier questions, Treeherder/Perfherder does actually aggregate performance data in an easy-to-digest form, which is how we provide all the frontend views at https://treeherder.mozilla.org/perf.html

My recommendation would be as follows:

Bring up your own test server and create a test program to submit data to it (both revision, and job/performance data).
Make your performance testing job submit data to your test server
We give you credentials and you start submitting data to stage (https://treeherder.allizom.org) for a few weeks, just to make sure everything's working
Once we're confident that your script is submitting good, reliable data, start submitting to treeherder production.

shinglyu commented 8 years ago

@wlach : Thanks a lot! I'll start step 1 and 2 and contact you when I'm ready for step 3. :)

shinglyu commented 8 years ago

Update: the test runner is almost ready https://github.com/shinglyu/servo-perf Some of the tp5 test case (those with complex js and many ad pics) will run forever even if I set -o output.png. And trying to close it with window.close() wrapped in setTimeout() doesn't work either. I'm trying to figure out the root cause and may force kill servo if it runs for too long.

shinglyu commented 8 years ago

@wlach : The treeherder-client on PyPi is not the latest version in tree. Also the sample code in the documentation is out of sync with the unit tests. Which version should I use to match the server version on stage and production?

wlach commented 8 years ago

@shinglyu Good catch! I updated the version on pypi to reflect what's in the tree (treeherder-client-2.1.0). Please use new the new version. The docs should be up-to-date at this point. If they're not, please file a PR to fix them or let me know what's wrong so I can do so.

larsbergstrom commented 8 years ago

@shinglyu Sometimes you may have better results with -x -o output.png. If Servo still does not exit with both of those flags, please open issues and we will look into them - that probably indicates a deadlock or other bug in Servo!

shinglyu commented 8 years ago

@wlach: Thank you @larsbergstrom: Thanks, I'll use -x -o. I'm tying to identify those tests in this bug: https://github.com/shinglyu/servo-perf/issues/1. I might temporarily disable those first, and files bugs for them.

shinglyu commented 8 years ago

@wlach I was able to submit a ResultSetCollection and JobCollection through the python API

_082

But I can't figure out how to format a performance artifact, I found the following code: https://github.com/mozilla/autophone/blob/16669a6a13c78dc376ed60b9c6b005d69bda572b/tests/perftest.py#L31 But when I tried to find it on treeherder, I found something like this: https://autophone.s3.amazonaws.com/pub/mobile/tinderbox-builds/mozilla-inbound-android-api-15/1460693988/autophone-talos-tp4m-remote.ini-1-nexus-6p-2-106e7edc-214c-4427-a5e4-4f0405e7d30d-autophone.log

I thought the log should be a JOSNified PerfherderArtificat? Or was it consumed in the backend and I can't see it from the UI?

shinglyu commented 8 years ago

Answering myself: Just found this test: https://github.com/mozilla/treeherder/blob/4a357b297fde5d5ba3f93c27a53aea53292f53a9/tests/e2e/test_perf_ingestion.py

shinglyu commented 8 years ago

Edit: I commited the wrong file... Ha, I wrote a test script and successfully submitted to my local treeherder instance. I'll try to hook it up with my test runner. _084

shinglyu commented 8 years ago

@larsbergstrom I'm not sure how to get the revision information when I submit data to treeherder. I am thinking about dumping the git log -n 1 output to a file, and load it when I run the test. But I'm not sure if that's flexible enough if we want to move the test to our CI infrastructure in the future. How can I get things like commit hash, author, and timestamp when I run the perf test on CI?

shinglyu commented 8 years ago

@larsbergstrom @wlach Also, I'm not sure how to present the data point. Talos' TP5 use this kind of summarization

summarization: subtest: ignore first 5 data points, then take the median of the remaining 20; source: test.py suite: geometric mean of the 51 subtest results. (ref)

That is one (median) time for each website, and one mean time for the whole suite. But for performance.timing we have multiple measurements, should we split them by measurement or by website? For example:

By measurement

Suite 1: responseEnd
- subtest: www.google.com
- subtest: www.amazon.com
- ...
Suite 2: domComplete
- subtest: www.google.com
- subtest: www.amazon.com
- ...
...

By website

Suite 1: www.google.com
- subtest: responseEnd
- subtest: domComplete
- ...
Suite 2: www.amazon.com
- subtest: responseEnd
- subtest: domComplete
- ...
...

wlach commented 8 years ago

@shinglyu For getting commit information, I wonder if it might not be easiest to use a library like GitPython (http://gitpython.readthedocs.org/)

For the second question, I think definitely seperating by measurement makes the most sense. However, I would question the utility of measuring anything but the time for the document being fully loaded and painted (which is what tp5o measures). There's a cost of complexity of recording additional information, I'd personally probably just start with the same metric as tp5o, then add additional measurements if they're proven to be needed.

shinglyu commented 8 years ago

@wlach I wants to separate my build step and test step, so I'll package the servo binary into a zip and copy it to my test runner's directory. So the test runner doesn't need to access the Servo code base. I think I'll use git log's formatting option to export the commit message as a JSON string, and dump it into the zip file.

Your suggestion makes a lot of sense. I think I'll only submit the domComplete timing for visualization, while keeping the other measurements in the log files. If we found that we need them, we can submit them later.

wlach commented 8 years ago

@shinglyu: BTW, soon treeherder will have the capability of ingesting github revision data (on a push level, no less) which I think will work much better than you submitting revision data by hand. So I'd just get something hacky working there for now (your solution sounds fine) and hopefully we can switch to something better later in this quarter.

https://bugzilla.mozilla.org/show_bug.cgi?id=1264074

shinglyu commented 8 years ago

@wlach Good to know!

I have automated the whole build > test > submit to local perfherder flow. I'll let it run for a few days to see if everything is stable enough for submitting to staging.

jgraham commented 8 years ago

@shinglyu Awesome!

shinglyu commented 8 years ago

@larsbergstrom The --headless build option was removed. How do I run servo headlessly now? I tried xvfb-run but I got

Xlib:  extension "XFree86-VidModeExtension" missing on display ":99".
Xlib:  extension "GLX" missing on display ":99".
Xlib:  extension "GLX" missing on display ":99".
Servo exited with return value -11

I was trying to run the test in Jenkins (no X), so it can be triggered after a successful servo build. Now I can only use a shell script to test it periodically.

jdm commented 8 years ago

I believe -z is the argument to pass when running Servo.

larsbergstrom commented 8 years ago

Yes, that's correct.

shinglyu commented 8 years ago

@jdm Thanks for the information!

@wlach I tried to push a dozens of test data onto my local treeherder, everything looks OK so far. How can I apply for access to the staging machine? Do I need to provide some sample data I submitted for review?

@larsbergstrom Now I run everything on my desktop computer, using a combination of Jenkins + bash script. What kind of setup do you suggest for 1) before june tech preview 2) in the long run? Are we planning to run it in buildbot/travis/taskcluster? I might need to modify the external interface for flexibility to connect with those systems.

I'll clean up the code a bit and PR a minimal version into the servo tree. I'm afraid that if I implemented too many test runner features, the review will be hard. And I'll probably create a ./mach test-perf command for it.

shinglyu commented 8 years ago

@larsbergstrom I also want to clarify our goal, are we looking for regression over time, or we are trying to compare with other browsers? Which one is more important/urgent right now?

metajack commented 8 years ago

I haven't discussed it with Lars yet, but I would think we want comparison with other browsers first. We know we are probably behind right now since we've never tested or tried to optimize this. It is important not to make the situation worse, but the immediate goals are to get page load performance at least into the same ballpark.

wlach commented 8 years ago

@shinglyu Submitting to treeherder stage should be no problem, just follow the procedure here to add credentials and ping me again when you've done so:

http://treeherder.readthedocs.io/common_tasks.html#generating-and-using-credentials-on-treeherder-stage-or-production

shinglyu commented 8 years ago

@metajack: Thanks for the information @wlach: Thank you, here is the bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1268381

shinglyu commented 8 years ago

Some initial data can be seen on the staging server now: https://treeherder.allizom.org/#/jobs?repo=servo&selectedJob=1

shinglyu commented 8 years ago

I documented how I submitted data to Perfherder: http://shinglyu.github.io/web/2016/05/07/visualizing_performance_data_on_perfherder.html Feedback is welcome!

cc: @wlach

wlach commented 8 years ago

Hey @shinglyu, great post! I would like to get some of that integrated within the treeherder documentation.

One thing, one should not be submitting servo data with the "talos" framework, as that's intended solely for the Gecko platform. I'd like to add a new performance framework for "servo", see: https://bugzilla.mozilla.org/show_bug.cgi?id=1271472

metajack commented 8 years ago

@shinglyu I see jobs are getting sent to staging on a regular basis, including the performance artifacts. Is there a way to compare these results with firefox yet?

@wlach If we have a new framework (seems like servo-perf is what was chosen) can we then compare performance results against things in Talos? We definitely want to be able to see how we're doing against Firefox's tp5 results.

wlach commented 8 years ago

@shinglyu I'm seeing some issues with this:

It appears as if you're still specifying the talos framework. As of yesterday, the servo-perf framework is on stage, so please assign that to your series.
It looks like the performance series signature keeps on changing, which makes it impossible to track performance and generate alerts. I can't see any logs to determine what you're actually submitting, so I don't know why this is. Could you create a log of what you're submitting to the job (PERFHERDER_DATA), and upload it to s3 or somewhere similar, and then link it to treeherder? You can see an example of adding a treeherder log to a job here:

https://github.com/mozilla/autophone/blob/master/autophonetreeherder.py#L437

(I presume Servo has an S3 account to use -- if you don't, let me know)

@metajack I think it's going to be really hard to compare against Firefox unless you're running the exact same test, which I don't believe you are at this point. Maybe the easiest route is to somehow run servo-perf against firefox, perhaps on a nightly basis?

metajack commented 8 years ago

How does our tp5 test differ from the one that Firefox runs?

wlach commented 8 years ago

@metajack I'm not familiar with exactly what servo-perf is testing, if it's using the same pageset as talos tp5 that's a great start at measuring the same thing. But even if the pageset is the same, you would have to make sure that the harness is recording information in the same way.

The numbers from talos vs. servo-perf seem pretty far off from one another:

https://treeherder.allizom.org/perf.html#/graphs?series=%5Bmozilla-inbound,6a48ac54b45a24ccd037d18e2d58b0472c4ccd6a,1,1%5D&series=%5Bservo,b28838a4b625b0f341e87aeb3e10aeb1633afeed,1,8%5D&series=%5Bservo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1,8%5D&zoom=1462841532703.583,1463063448000,0,2206.5215179885645

larsbergstrom commented 8 years ago

@wlach I'd expect Servo's numbers to be pretty far off - we have done nearly zero "complete page load" performance work yet, and there's a ton of known low-hanging fruit. So, that chart may be pretty close to reality :-)

jgraham commented 8 years ago

I may be missing something, but trying to compare performance numbers from different implementations of the "same" testsuite running on different hardware seems like it isn't going to produce good results? The infrastructure that produces results for Servo should also should submit its own results for Firefox, running with the same harness on the same hardware in order to get meaningful numbers.

metajack commented 8 years ago

@jgraham Thanks for pointing that out.

@shinglyu What are the rough specs for what the tp5 results you have submitted so far run on? Are you planning to add Firefox tests on the same hardware?

shinglyu commented 8 years ago

@metajack We can't compare out servo-perf test with existing tp5-Firefox-talos test. The reason is that we have our custom test runner ( Open PR: https://github.com/servo/servo/pull/11107). It runs a subset of tp5 tests, because some pages makes servo run forever (see #11087 ). Also it measures domComplete time from performance.timing, which is different from how talos measures. We are planning to run Firefox in our test runner our way, as @jgraham said, see https://github.com/shinglyu/servo-perf/issues/4

@wlach I changed the framework, but I broke the test runner, so it failed to submit data for 2 days. The data you are looking at is probably 2 days old. The latest one should be correct: https://treeherder.allizom.org/#/jobs?repo=servo&selectedJob=16

About the "performance series signature", is that the job_guid? I though that was for identifying the specific test run, so I randomly generate a UUID style string. The old data seems to be on the same graph, but the new onews shows up as one data point per graph: https://treeherder.allizom.org/perf.html#/graphs?series=[servo,4df09c87df5f6294eb04c94f19ce8a0aae144c0e,1,8]&series=[servo,951d1b202b324d85bc3229a334b88370f6c18363,1,1]&series=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1,1]&selected=[servo,4df09c87df5f6294eb04c94f19ce8a0aae144c0e,2,3,1]

And yes, I haven't push the log to S3 and create a link in the artifact. It's on my backlog and I'll open a bug for that.

shinglyu commented 8 years ago

@wlach: Now the data points are in the same graph again. https://treeherder.allizom.org/perf.html#/graphs?series=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,1]&selected=[servo,f6067f4bc04fef24aa4eec8ff55794727bfe5f7f,9,17]

I assume the problem is because I'm transitioning from talos to servo-perf?

shinglyu commented 8 years ago

In case you are confused, the May 10 commit is still using talos. I changed to servo-perf and break the code, so there are some missing data between May 10 and May 12. The first May 12 build is also broken, so the new clean data should start from the bca625bd8e60 commit.

servo / servo

Realistic Page Load Time Test #10452

Goal

Existing Solutions

Proposal

Measurements

Test Environment

Test Harness

Test Cases

Visualization & Notificaion

Plan