Open gravitystorm opened 9 years ago
A test that can be run on EC2 and take less than an hour and costs less than a dollar to run. This could be used as a smoke test for pull requests.
My full-world tests actually take <1h to run (on fast hardware). They do cost more than 1$ to run on EC2.
mid-zoom biased
For total time rendering, it's high-zoom biased I believe.
A test that can be run on EC2 and take less than an hour and costs less than a dollar to run. This could be used as a smoke test for pull requests.
EC2 is subject to about an order of magnitude performance variation between instances. For a particular benchmark configuration on amazon repeated many times on different instances configured the same, the 10% performance is 587 TPS and the 90% is 2537.
A 200% decrease in rendering throughput would be a pretty catastrophic style change, but wouldn't be able to be distinguished on EC2.
Do we already have some tools/scripts to test the rendering speed, even roughly? I think Kosmtik automation could be nice option. We could for example craft some exporting URLs with a big enough bbox and execute them for many zoom levels automatically, but maybe there is more elegant way of doing it.
Do we already have some tools/scripts to test the rendering speed, even roughly?
render_list
with a list of tiles from production is the standard way.
I think Kosmtik automation could be nice option.
There are enough differences it's not a great option, except for horrible performance failures
We could for example craft some exporting URLs with a big enough bbox and execute them for many zoom levels automatically, but maybe there is more elegant way of doing it.
It's essential the workload is realistic.
The method I've used lately has been to randomly sample running queries and ignore the time spent in Mapnik.
Do you have any scripts that other people could use and compare results or is it just manual testing?
I haven't needed any scripts, it's all been one-line command line stuff.
Could you share it anyway? Even if it's short, we don't have any standard tools to measure performance and compare results at the moment.
echo tile_list | render_list -n <N> -l 256 -f
for normal stuff, render_list -n <N> -l 256 --all -f -z 0 -Z 12
for testing monthly rerendering
I'm not sure, but I think this message relates to osm-carto PostgreSQL performance testing (comparison of current setup with partitioned tables):
https://lists.openstreetmap.org/pipermail/dev/2018-March/030168.html
There's a tool called render_speedtest
from renderd package (it can be directly built with https://github.com/openstreetmap/mod_tile). It tries to make a thorough test, example snippet from a test running on my virtual machine with 1 thread (default value):
Zoom(9) Now rendering 4 tiles
Rendered 4 tiles in 1.34 seconds (2.98 tiles/s)
Zoom(10) Now rendering 12 tiles
Rendered 12 tiles in 5.78 seconds (2.08 tiles/s)
Zoom(11) Now rendering 36 tiles
Rendered 36 tiles in 11.45 seconds (3.14 tiles/s)
Zoom(12) Now rendering 120 tiles
Rendered 120 tiles in 37.66 seconds (3.19 tiles/s)
Zoom(13) Now rendering 456 tiles
Rendered 456 tiles in 163.32 seconds (2.79 tiles/s)
Zoom(14) Now rendering 1702 tiles
Rendered 1702 tiles in 667.46 seconds (2.55 tiles/s)
There's a tool called render_speedtest from renderd package
Don't render down to z14. Any rendering test that tries to render the world past z12 will give distorted results.
What causes this distortion?
What causes this distortion?
The fact that it doesn't represent a realistic workload. All performance testing needs to test something that matters, and the time to render the world on z13+ doesn't matter because no one does it. In particular, the average complexity of metatiles will be different than a tile server's workload, as will be the balance between different zooms. There are 4x as many z14 tiles as z13 tiles, but not 4x as many z14 tiles rendered as z13 tiles rendered.
https://planet.openstreetmap.org/tile_logs/renderd/renderd.yevaud.20150503.log.xz is an old log of what is rendered, taking into account the tile CDN and the renderd tile store
EC2 is subject to about an order of magnitude performance variation between instances. For a particular benchmark configuration on amazon repeated many times on different instances configured the same, the 10% performance is 587 TPS and the 90% is 2537.
This is very surprising. It's not something we've encountered at my job (web application performance testing). I would be interested in understanding this issue better, @pnorman do you remember what instance type this was?
This is very surprising. It's not something we've encountered at my job (web application performance testing). I would be interested in understanding this issue better, @pnorman do you remember what instance type this was?
It isn't my test results, it was from a comprehensive test comparing lots of different cloud options. gp2 storage had just come out.
I'm sure it's gotten better, but there's still going to be variation and before benchmarking, I'd want to test the machine with fio
for disk and something else for CPU.
What if we use some smarter testing pattern, like: "old - new - old - new - old - new..." on the same machine (instead of old and new being done once and on different machines and time)? That would help to compare tests more directly and avoid non-systematic errors.
Do we have a machine for such testing? Maybe Travis could be used as a first line of defense?
As a partial solution, could we run EXPLAIN
on all queries and check the cost and query plan reported by postgres? That would be cheap to run and catch a part of the performance problems with queries.
As a partial solution, could we run
EXPLAIN
on all queries and check the cost and query plan reported by postgres? That would be cheap to run and catch a part of the performance problems with queries.
How much data would you load for this?
I am thinking of using a small European country, such as Portugal. I made a script that runs EXPLAIN on all queries in the MML. Even running it on Luxembourg would catch the addresses sequential scan (#3937):
Commit 05dc392c, just before the fix:
Total cost 35698
Most expensive addresses 34821.11
Commit e66889e2b2512b028551e19e12a9103eae176d09 with the fix:
Total cost 884
Most expensive turning-circle-casing 318.99
However, even though it's quite succesful in this case, these kinds of bugs seem quite rare to me and I doubt the script brings great benifits in monitoring performance in general.
Cartography is more important than performance, but performance is much harder to test! We've been heavily reliant on @pnorman to run performance tests on his setup but I'd like to open that up so that more people can become involved and that we can automate it as much as possible.
There's a whole load of hardware considerations that makes cross-server test results impossible to compare, but we should concentrate on being able to compare two different commits and give relative answers. It would be useful to have:
The tests should be roughly indicative of real world rendering patterns e.g. city-biased and mid-zoom biased.