quickwit-oss / search-benchmark-game

Search engine benchmark (Tantivy, Lucene, PISA, ...)
https://tantivy-search.github.io/bench/
MIT License
78 stars 36 forks source link

Continuous benchmark #10

Open petr-tik opened 5 years ago

petr-tik commented 5 years ago

Continuous benchmarking

Add a CI-like job to run the benchmark automatically.

It will help developers, potential users and tantivy-curious people to track performance numbers continuously. Automating also means less stress and hassle for the maintainers/developers of tantivy.

Granularity

We can choose to either run a benchmark on every commit or on every release.

On every commit

Integrate benchmarking suite into CI on the main tantivy repo. Using travisCI's after_success build stage, run the benchmark, append results to results.json on search-benchmark repo.

Pros:

Commit-specific perf numbers - easier to triage perf regressions. Will create a more detailed picture of the hot path for the future. Automated - don't have to fiddle, re-run benchmarking locally.

Costs/cons:

Too much noise - some commits are WIP or harm perf for the sake of a refactor. Is it really necessary to keep that data? Makes every CI job run longer. Benchmarking should be done on a dedicated machine to guarantee similar conditions. CI jobs runs inside uncontrolled layers of abstraction (docker inside VM, inside VM). To control the environment and keep it automated, we would need to dedicate a VPS instance. This is an expense, potential security vulnerability and needs administration.

On every release

Same as above, only use git-tags to tell if this commit has a new release.

Pros:

Fewer runs - cheaper on HW, doesn't slow builds down. Releases are usually semantically important points in history, where we are interested in perf.

Cons/costs:

Still needs dedicated HW to run consistently. Needs push access to tantivy-benchmark repo.

Presentation

Showing data from every commit might be unnecessarily overwhelming. The current benchmark front-end is clean (imho) and makes it easy to compare results across queries and versions.

On the front-end, we can show 0.6, 0.7, 0.8, 0.9 and latest commit or release.

Power-users or admins can be given the choice to massively extend the table to every commit.

Implementation

A VPS that watches the tantivy main repo, builds a benchmark and commits new results at a decided frequency.

Thoughts?

fulmicoton commented 5 years ago

That would be awesome of course. Do you want to take over that?

I have had some bad performance regression due to the compiler in the past. Not some obvious jemalloc related thing but a change in inlining that had a catastrophic impact. It would be great to spot those rapidly.

petr-tik commented 5 years ago

Happy to help.

Can we flesh out the design before I start.

Couple of questions:

  1. Do you want to track every a) commit or b) release/tag?
  2. If per commit, do you want perf regressions to block PR merges? I don't know if it's possible - will need to research. It will also require that no commits are made directly to master, which isn't the case at the moment.
  3. Do you currently run the benchmarks on your own machine? If not, what's your current config for running benchmarks.

I am concerned about benchmarking on TravisCI, because it's an uncontrolled environment full of kernel abstractions, which is bound to make results unreliable. The author of criterion.rs says it affects results.

The best way is to dedicate a server, give travisCI ssh access to it and run benchmarks there. Server can be either virtual or real. VPS like DigitalOcean, Linode, OVH are still abstractions, but hopefully more consistent. This needs to be tested.

If you or anyone else has a home server, they don't mind dedicating to this would be great. We can control the environment and benchmark in consistent conditions. Still requires ssh access from travisCI, which requires trust between people, who are working on tantivy.

petr-tik commented 5 years ago

Found packet.net, a bare metal server provider with an API and advertised support for open source. It depends on our requirements, pricing and availability of other options. We might not need them, if we decide to benchmark less often than CI. Or packet might be too expensive.

If/when we clarify the points above, I will be happy to start relevant work and potential conversations with providers.

Can you please clarify the points above?