google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.31k stars 209 forks source link

Better issubset performance #514

Closed marco6 closed 11 months ago

marco6 commented 11 months ago

The current implementation for set.issubset is simple, but can be easily exploited to create very big (unnecessary) memory allocations:

s = set([]) # empty set!
s.issubset(range(100000)) # allocates a lot and takes a lot of time

This PR improves the current implementation by:

This PR will not avoid worst case scenarios, where execution time (but not memory!) is still depending on the size of the iterator:

s = set([-1])
s.issubset(range(100000)) # takes a lot of time

I added few benchmark for some representative cases. Here are my results: before.txt after.txt, and benchstat output:

$ benchstat before.txt after.txt
goos: linux
goarch: amd64
pkg: go.starlark.net/starlark
cpu: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
                                                │  before.txt   │              after.txt               │
                                                │    sec/op     │    sec/op     vs base                │
Starlark/bench_issubset_duplicate_large_small-4     282.3µ ± 1%   124.2µ ±  1%  -55.99% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_same-4            86.56µ ± 1%   66.75µ ±  1%  -22.89% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_small_large-4    924.18µ ± 2%   74.65µ ±  1%  -91.92% (p=0.000 n=20)
Starlark/bench_issubset_unique_large_small-4        261.5µ ± 1%   122.0µ ±  1%  -53.37% (p=0.000 n=20)
Starlark/bench_issubset_unique_same-4              225.20µ ± 1%   82.47µ ± 11%  -63.38% (p=0.000 n=20)
Starlark/bench_issubset_unique_small_large-4      2093.28µ ± 1%   76.68µ ±  4%  -96.34% (p=0.000 n=20)
geomean                                             375.1µ        88.41µ        -76.43%

                                                │   before.txt   │              after.txt               │
                                                │      B/op      │     B/op      vs base                │
Starlark/bench_issubset_duplicate_large_small-4    248.22Ki ± 0%   80.41Ki ± 0%  -67.60% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_same-4            31.66Ki ± 0%   10.12Ki ± 0%  -68.02% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_small_large-4    480.02Ki ± 0%   10.14Ki ± 0%  -97.89% (p=0.000 n=20)
Starlark/bench_issubset_unique_large_small-4       248.17Ki ± 0%   80.41Ki ± 0%  -67.60% (p=0.000 n=20)
Starlark/bench_issubset_unique_same-4              247.79Ki ± 0%   10.23Ki ± 0%  -95.87% (p=0.000 n=20)
Starlark/bench_issubset_unique_small_large-4      1848.19Ki ± 0%   10.23Ki ± 0%  -99.45% (p=0.000 n=20)
geomean                                             274.6Ki        20.27Ki       -92.62%

                                                │ before.txt  │             after.txt              │
                                                │  allocs/op  │ allocs/op   vs base                │
Starlark/bench_issubset_duplicate_large_small-4    21.00 ± 0%   12.00 ± 0%  -42.86% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_same-4          13.000 ± 0%   7.000 ± 0%  -46.15% (p=0.000 n=20)
Starlark/bench_issubset_duplicate_small_large-4   17.000 ± 0%   7.000 ± 0%  -58.82% (p=0.000 n=20)
Starlark/bench_issubset_unique_large_small-4       21.00 ± 0%   12.00 ± 0%  -42.86% (p=0.000 n=20)
Starlark/bench_issubset_unique_same-4              21.00 ± 0%   12.00 ± 0%  -42.86% (p=0.000 n=20)
Starlark/bench_issubset_unique_small_large-4       24.00 ± 0%   12.00 ± 0%  -50.00% (p=0.000 n=20)
geomean                                            19.14        10.03       -47.61%