joshday / OnlineStats.jl

⚡ Single-pass algorithms for statistics
https://joshday.github.io/OnlineStats.jl/latest/
MIT License
831 stars 62 forks source link

Streaming algorithm for the median problem #239

Closed rikhuijzer closed 2 years ago

rikhuijzer commented 2 years ago

On Slack in the statistics channel Yakir Gagnon asked why this package doesn't implement a Median. At last, I thought, finally I can apply something that I've learned in the advanced algorithms class of my studies. Credits for the information in the rest of this issue go to the course notes written by Mark de Berg. Except for any mistakes: those can be attributed to me. I hope this information is useful. My apologies if this is information is trivial for someone who spent much time with streaming algorithms; I don't know enough about streaming algorithms to judge the complexity of the approach.

Let

Theorem. Any deterministic algorithm for computing the median exactly must use Ω(m log (n/m)) bits in the worst case, when m < n/2.

Therefore, we need to settle for an ϵ-approximate median: an item a_i with

floor((1/2 - ϵ)(m + 1)) ≤ rank(a_i) ≤ ceil((1/2 + ϵ)(m + 1))

and we will also need a randomized algorithm. Given a function random(a, b) an integer r with a ≤ r ≤ b uniformly at random. (So, random(a, b) = rand(a:b) if I'm not mistaken.).

The idea is now to run the algorithm k times (this can be simultaneously in the streaming setting) and collect k answers j_i, ..., j_k. Next, report the median of the answers. The larger k, the higher the probability of a good answer. Pseudo algorithm (translated to Julia syntax):

function Median(σ::Stream, m::Int)
    R = [random(0, m - 1) for i in 1:k]
    J = []
    for (i, a_i) in enumerate(σ)
        if i in R
            push!(J, a_i)
        end
    end
    return median(J)
end

Note that the algorithm assumes that we know the length of the stream m.

Theorem. By using the median trick, we can report a (1/4)-approximate median with probability at least 1-2(e/4)^{k/4} and uses O(k log(n + m)) bits of storage.

image

It's too much work for me to figure out the details of this proof again and to copy this proof into a GitHub issue. In summary, compute E[X] and use the Chernoff bound for Poisson trials. Let me know if more info is required.