google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.31k stars 210 forks source link

Optimizing hashtable.grow() for faster hashtables #459

Closed SamWheating closed 1 year ago

SamWheating commented 1 year ago

Update: Decided that this is not a worthwhile optimization due to modest performance improvements, updating the TODO and otherwise not making any other changes. Leaving my original notes for reference.


Addressing this TODO I found in the hashtable implementation because it seems like a clear performance win and an interesting investigation.

tl;dr instead of using the hashtable.insert method to copy elements into the new bucket, we can write a specialized hashtable.growInsert method which bypasses some of the deduplication and checks which only apply when inserting into a pre-existing hashtable.

Optimizations include:

Let me know what you think or if you have any other suggested optimizations.

Benchmark results

Overall these results are less significant than I expected, but I suspect that the performance gap could be widend in a purpose-built hashtable insertion benchmark (for example, keys with more compute-intensive hash functions, or benchmarks involving higher write volumes)

Before:

go clean -testcache && go test -v --run=BenchmarkHashtable -bench=BenchmarkHashtable ./starlark/... -benchtime=10000x
goos: darwin
goarch: arm64
pkg: go.starlark.net/starlark
BenchmarkHashtable
BenchmarkHashtable-10              10000            652108 ns/op
PASS

After:

starlark-go % go clean -testcache && go test -v --run=BenchmarkHashtable -bench=BenchmarkHashtable ./starlark/... -benchtime=10000x
goos: darwin
goarch: arm64
pkg: go.starlark.net/starlark
BenchmarkHashtable
BenchmarkHashtable-10              10000            638644 ns/op
PASS

Open Questions

adonovan commented 1 year ago

Thanks for taking a crack at this.

  • Is the additional complexity of this change worth the ~2% performance improvement?

I'm not convinced that it is. It's a 2% improvement on a benchmark that exercises only the optimized operation, which itself accounts for (I guess) at most a few percent of total running time in a realistic workload. (It's unfortunate that we don't have realistic large-scale benchmarks in this repo.)

Can't insertGrow be expressed as a boolean parameter to the existing insert method that causes it to skip the Equal check (assuming it to be false). That may be a reasonable compromise. The existing hash could also be passed in to avoid recomputing it; this might be best expressed by factoring the common tails of insert and insertGrow (essentially from "retry:" onwards).

SamWheating commented 1 year ago

Can't insertGrow be expressed as a boolean parameter to the existing insert method that causes it to skip the Equal check (assuming it to be false). That may be a reasonable compromise.

Yeah this is a good suggestion, I'll see if I can refactor this to reduce the code duplication.

SamWheating commented 1 year ago

Hey @adonovan,

I spent a bit of time refactoring this this morning, and any changes I made pretty much cancelled out the marginal performance gains (I guess due to the overhead of the if grow {} checks, additional control flow, etc). Trying to bundle both code-paths into a single method also introduces a similar level of complexity to the codebase as the changes proposed in this PR.

With that in mind, I think that we should either continue with this approach, understanding that insert() and growInsert() are significantly different methods, or just close this ticket as the opportunity because the opportunity for optimization isn't all that significant - either approach is totally fine by me.

Thoughts?

adonovan commented 1 year ago

I spent a bit of time refactoring this this morning, and any changes I made pretty much cancelled out the marginal performance gains

Thanks for investigating this, Sam. I think the gains are probably too slim to justify the additional logic, but rather than put this PR to waste, let's record what you've found by replacing the old TODO comment with this:

// Double the number of buckets and rehash.
//
// Even though this makes reentrant calls to ht.insert,
// calls Equals unnecessarily (since there can't be duplicate keys),
// and recomputes the hash unnecessarily, the gains from
// avoiding these steps were found to be too small to justify
// the extra logic: -2% on hashtable benchmark.

You could remove the "pool" comment too, since it's rather fiddly to implement. (sync.Pool is good only for uniform objects by these bucket arrays vary in size.)

SamWheating commented 1 year ago

let's record what you've found by replacing the old TODO comment with this.

Done ✅

Are there any other parts of this codebase which you think might be good candidates for performance improvements? This was an interesting exercise and I've got some time set aside this week for similar work.

adonovan commented 1 year ago

Are there any other parts of this codebase which you think might be good candidates for performance improvements? This was an interesting exercise and I've got some time set aside this week for similar work.

Not really. When I developed the package I went over it quite carefully trying to squeeze out the most efficiency within the current design, and then I profiled and optimized a large and performance-sensitive application (similar to Bazel) on realistic workloads (which unfortunately can't easily be turned into benchmarks---and in any case that particular project was cancelled). So I think most of the easy wins that actually matter have been claimed already. I'm reluctant to suggest that we optimize anything for its own sake, in the absence of a realistic benchmark that some user actually cares about, because it's a waste of effort that could be spent elsewhere, and it makes the code more complicated than it needs to be.

But I appreciate your offer, and this PR. Replacing a TODO with an informed statement of fact is actual progress.

SamWheating commented 1 year ago

Thanks for the background - I wasn't sure how much profiling work had already been done and I totally agree that there's no use in optimization for the sake of optimizing.