RFC - High-level performance metrics

vcschapp commented 1 month ago

This is a request for comments to start a discussion about some possible features. Let me know if it's not the right place, as I'm happy to move this issue or create a new one elsewhere.

Summary

This is a proposal to instrument MapLibre GL JS with a small set (no more than 2–3) of high-level performance metrics that can be used to drive sustained improvements in initial map load performance and interactivity. A decision framework for selecting metrics to track is given and three metrics are suggested based on the framework: underutilized network time, unrendered resource time, and cancelled network time.

The decision framework and the suggested metrics are independent proposals. For example, the decision framework may be more or less right but there may be better alternatives to the proposed metrics that more closely meet the framework goals; or the decision framework may require refinement.

Motivation

The things that get measured are the things that improve.

We want to improve the end-user experienced performance of interactive maps on MapLibre, including initial map load time and the time to completely respond to end user interactions like pans and zooms. To improve it, we need to measure it. To be more specific, we need to measure performance in a way that can feed mechanisms that drive performance improvements and prevent regressions.

Decision Framework

The following tenets are proposed for selecting metrics to instrument into MapLibre:

Easy to understand. A person with a basic high-level understanding of how map rendering client software works should be able to understand what each specific metric is measuring, why the quantity measured is important, and how each specific metric is meaningfully different from the others.
Limited number. Measuring too many quantities diffuses the signal, reduces the probability that the metrics are easy to understand, and makes it more likely that nobody is paying attention. The right number of metrics just a handful, around 2 or 3.
Directional, not diagnostic. The purpose of the metrics is to drive behavior that keeps end-user experienced performance moving in the right direction. It is not to detect specific bottlenecks or identify specific changes that need to be made. The envisioned model, once the metrics are in place, is to select a metric to optimize, select a goal value, and create an issue, contract, or work order to drive the chosen metric down to the goal value without regressing the other metrics. The means to achieve the goal will be left to the person or persons doing the work.
Isolate changeable client behavior. The end-user experience of maps is influenced by both client-side and server-side factors. Since MapLibre does not control the server side, and the server side can be independently optimized by those who do control it, any useful metric must try to exclude server-side factors such as download TTLB to the extent possible.
Aggregate to a meaningful signal. To understand if performance is getting better, staying the same, or getting worse, we need to be able to aggregate metrics meaningfully. This means that, in a test environment that holds confounding factors relatively constant (i.e. hardware, OS, browser version, and system resource utilization), a large collection of metric observations should give an accurate indication of client-side performance when fed through mean, median, and perhaps other aggregate functions.
Comparable across releases. Aggregate value between releases must give meaningful information about the relative client-side performance of those two releases. This means the metric has to be general enough to continue existing from release to release, and that has to measure something that is directly tied to end-user experienced performance.

Metrics

The following metric are proposed to be instrumented into MapLibre GL JS:

Underutilized network time. Wall time between when MapLibre has access to enough information to know it needs to make network requests for resources and when it does in fact make those requests.¹ Examples. A/ Once MapLibre has access to the style sheet for a map, it has access to enough information to request the style’s sprite sheets and the tiles visible in the current viewport. The underutilized network time metric ticks from the time the last byte of the style sheet is received until the time the last tile is requested. B/ If MapLibre has access to a tile, it has enough information to request any glyph pages needed by the tile that haven’t been fetched yet. The underutilized network time metric ticks retroactively from the TTLB of the tile until all needed glyph pages have been requested. C/ On a pan or zoom, MapLibre instantly has all the information needed to fetch any new tiles that are needed. The underutilized network time metric ticks retroactively from the pan or zoom action until the last tile is requested.
Unrendered resource time. Wall time between when MapLibre has downloaded resources such as tiles, sprite sheets, or glyph pages, and when those resource are actually rendered to the end user.² Time starts ticking as soon as MapLibre has at least one resource that needs to be rendered but has not been, and stops ticking when there are no more resources that need to be rendered but have not been.³ Example. A/ A pan interaction brings one new tile into view and MapLibre requests the tile from the network. The unrendered resource time metric ticks from the time the last byte of the tile is received from the network until the time the tile is parsed into a suitable data structure and sent to the graphics subsystem. B/ The previous example gets a bit more complicated if the pan brought multiple new resources into play. The key is that the clock is ticking as long as there is at least one resource fully downloaded that has not yet been rendered, and stops when there are no more such resources.
Cancelled network time. This is a measurement of the wall time dedicated to network requests which MapLibre cancels before they complete because a subsequent end user interaction obsoletes the request. This metric does not include time spent fully downloading resources that are never used, because that aspect is already counted in unrendered resource time. Example. The map is initially showing zoom 14. The end user does a mouse wheel or pinch zoom out through zoom 13 before landing on zoom 12. Early in the process, MapLibre issued a number of tile fetches for zoom 13 but cancels them when the end user lands on zoom 12. The clock starts ticking when the first zoom 13 request is issued and stops ticking when the last one is cancelled.

Implementation Notes

The envisioned implementation is an opt-in observer pattern where an interested observer can install a metrics hook into the MapLibre GL JS client to receive periodic metric observations as they become available. The observer can choose what to do with the metrics at that point.

One example usage that is explicitly envisioned is a performance test or performance canary installed in a CI/CD pipeline. The performance canary would run a large program of repeatable map interactions on MapLibre instances with a metrics observation hook installed and publish the metrics to an appropriate metrics repository. The pipeline would have monitors set up to block the pipeline if the metrics cross specific alarm thresholds, and the metrics would be aggregated and fed into dashboards that are regularly monitored by humans who have an ownership stake in maps performance.

Footnotes:

¹ It doesn’t matter whether a request actually goes over the network or is served from a local cache such as the browser’s cache. What is important is that the MapLibre has access to the information needed to make the request but that it hasn’t actually made the request yet. ² Where “rendered to the end user” means submitted to whatever system is responsible for graphical rendering. So, e.g., it could mean “sent to the GPU”. ³ A resource can either be rendered or can stop needing to be rendered because MapLibre decides it’s obsolete, like a tile for a zoom level the user has since left.

HarelM commented 1 month ago

I think this proposal is great, thanks for taking the time to write it down. I think it would also be interesting if you would present it in the monthly meeting.

A few notes worth looking into and take into consideration:

There's a benchmark code ATM that allows uploading a "proprietary" build for every version to github pages to allow benchmarking. I believe this code is not a good solution to the problem as it doesn't allow using an officially built "binary" and test it. I believe the right solution should be to create (or add) specific events so that an external executor can use it to measure things. This means that you won't be able to measure past versions before this is introduced, and that new events will only be available from a specific version onwards. I'm fine with that.
Benchmarks are not ran as part of the CI, which means their code can break and the performance can reduce without anyone noticing. If this suggestion would be incorporated as part of a CI (which I think it should, otherwise we will end up in the same situation as we are now) there need to be a clear way to know if something cause a failure or not, moreover it should not be flaky as it would cause a mistrust, which completely misses the point.
I'm not sure cancel is a good metric as I think it's OK to cancel stuff to improve user experience, so this might be tricky.
There are currently flags to change a behavior (for example the cancel tiles which can improve one metric and worsen another, how would this be measured? By looking at the default options?
I'm not sure I know how to define this, but for me a metric I'll be interested to measure is the time from initial interaction to when the map finishes rendering - i.e. the user moves the map and when would be the time when things are finished. it might already be covered by the defined matrices, IDK.
Another idea of something to measure is how much time it takes to show say 80% of the map given a map movement - it is expected to see non-fully rendered map while moving the map, but it might be interesting to measure not when it's 100% complete, but when it's "good enough" from a user perspective. I wrote 80% but it might be a different number.

I definitely agree that 2-3 matrices are what we should aim for, but we might consider making sure we raise enough events with meaningful data so that someone that wants to measure something else from the defined 2-3 can do that on their on and make sure it is kept in a good state.

HarelM commented 1 month ago

CC: @ibesora

ibesora commented 1 month ago

This is great. I agree with @HarelM you should present it in the monthly meeting if possible. Some notes on my end:

Adding those metrics is orthogonal to how benchmarks are used right now. If events to measure all those things are added to Maplibre, we could even create benchmarks using the current infrastructure to get meaningful values.
We should definitely run benchmarking as part of CI. Maybe we can tackle that in conjunction to bringing up all these new events.
Heavily agree on raising "lots" of events even if we decide to only check two or three metrics. There will be different things to measure needed for different use cases. There's no one-size-fits-all.

jonahadkins commented 1 month ago

The things that get measured are the things that improve.

put it on sticker! +1000

Heavily agree on raising "lots" of events even if we decide to only check two or three metrics. There will be different things to measure needed for different use cases.

agree with @ibesora here

overall, super supportive of getting this in! lmk how we can help, if at all

hy9be commented 1 month ago

Anyone interested in working on this together? I am not super familiar with the gl js codebase, but happy to create a draft PR in the hope of eliciting better ideas.

hy9be commented 1 month ago

Took a stab: https://github.com/maplibre/maplibre-gl-js/pull/4901.

maplibre / maplibre-gl-js