On Slack in the statistics channel Yakir Gagnon asked why this package doesn't implement a Median. At last, I thought, finally I can apply something that I've learned in the advanced algorithms class of my studies. Credits for the information in the rest of this issue go to the course notes written by Mark de Berg. Except for any mistakes: those can be attributed to me. I hope this information is useful. My apologies if this is information is trivial for someone who spent much time with streaming algorithms; I don't know enough about streaming algorithms to judge the complexity of the approach.
Let
σ be a stream <a_0, ..., a_{m-1}> of m distinct items,
the rank be defined as rank(a_i) = 1 + the number of items in σ that are smaller than a_i
and median(σ) = item of rank floor((m+1)/2) or ceil((m+1)/2).
Theorem. Any deterministic algorithm for computing the median exactly must use Ω(m log (n/m)) bits in the worst case, when m < n/2.
Therefore, we need to settle for an ϵ-approximate median: an item a_i with
and we will also need a randomized algorithm. Given a function random(a, b) an integer r with a ≤ r ≤ b uniformly at random. (So, random(a, b) = rand(a:b) if I'm not mistaken.).
The idea is now to run the algorithm k times (this can be simultaneously in the streaming setting) and collect k answers j_i, ..., j_k. Next, report the median of the answers. The larger k, the higher the probability of a good answer. Pseudo algorithm (translated to Julia syntax):
function Median(σ::Stream, m::Int)
R = [random(0, m - 1) for i in 1:k]
J = []
for (i, a_i) in enumerate(σ)
if i in R
push!(J, a_i)
end
end
return median(J)
end
Note that the algorithm assumes that we know the length of the stream m.
Theorem. By using the median trick, we can report a (1/4)-approximate median with probability at least 1-2(e/4)^{k/4} and uses O(k log(n + m)) bits of storage.
It's too much work for me to figure out the details of this proof again and to copy this proof into a GitHub issue. In summary, compute E[X] and use the Chernoff bound for Poisson trials. Let me know if more info is required.
On Slack in the statistics channel Yakir Gagnon asked why this package doesn't implement a
Median
. At last, I thought, finally I can apply something that I've learned in the advanced algorithms class of my studies. Credits for the information in the rest of this issue go to the course notes written by Mark de Berg. Except for any mistakes: those can be attributed to me. I hope this information is useful. My apologies if this is information is trivial for someone who spent much time with streaming algorithms; I don't know enough about streaming algorithms to judge the complexity of the approach.Let
<a_0, ..., a_{m-1}>
ofm
distinct items,rank(a_i) = 1 + the number of items in σ that are smaller than a_i
median(σ) = item of rank floor((m+1)/2) or ceil((m+1)/2)
.Theorem. Any deterministic algorithm for computing the median exactly must use
Ω(m log (n/m))
bits in the worst case, whenm < n/2
.Therefore, we need to settle for an ϵ-approximate median: an item
a_i
withfloor((1/2 - ϵ)(m + 1)) ≤ rank(a_i) ≤ ceil((1/2 + ϵ)(m + 1))
and we will also need a randomized algorithm. Given a function
random(a, b)
an integerr
witha ≤ r ≤ b
uniformly at random. (So,random(a, b) = rand(a:b)
if I'm not mistaken.).The idea is now to run the algorithm
k
times (this can be simultaneously in the streaming setting) and collectk
answersj_i, ..., j_k
. Next, report the median of the answers. The largerk
, the higher the probability of a good answer. Pseudo algorithm (translated to Julia syntax):Note that the algorithm assumes that we know the length of the stream
m
.Theorem. By using the median trick, we can report a (1/4)-approximate median with probability at least
1-2(e/4)^{k/4}
and usesO(k log(n + m))
bits of storage.It's too much work for me to figure out the details of this proof again and to copy this proof into a GitHub issue. In summary, compute
E[X]
and use the Chernoff bound for Poisson trials. Let me know if more info is required.