joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

Add `TopK` #246

Closed mcabbott closed 2 years ago

mcabbott commented 2 years ago

This wants to add a way to keep the largest k elements.

Many ways to do this were recently timed here: https://discourse.julialang.org/t/maintaining-a-fixed-size-top-n-values-list/78868/15 This matches the simplest implementation there. It's possible that the slight gains of the others are worth it, although on my computer they don't seem to be visible, on that example.

However, the mutable struct with o.n += 1 at every step seems to make a huge difference. Is this expected, and or is there a way around it?

julia> v1 = init1(N); v1 = min_values1(N, x, v1); v11 = init1(N);

julia> @btime min_values1($N, $x, $v11);
  10.270 ms (0 allocations: 0 bytes)

julia> v4 = init4(N); v4 = min_values4(N, x, v4); @assert last(v1) == last(v4); v44 = init4(N);

julia> @btime min_values4($N, $x, $v44);
  10.370 ms (0 allocations: 0 bytes)

julia> v6 = init6(N); v6 = min_values6(N, x, v6);

julia> @btime min_values6($N, $x, $v66);
  10.248 ms (0 allocations: 0 bytes)

# This PR:

julia> v7 = fit!(TopK(N; rev=true), x).value;

julia> @assert sort(v7) == sort(v1)

julia> @btime fit!($(TopK(N; rev=true)), $x);
  33.560 ms (0 allocations: 0 bytes)
  15.436 ms (0 allocations: 0 bytes)  # without o.n += 1 step
  10.273 ms (0 allocations: 0 bytes)  # with immutable struct, and without o.n += 1