vexdb / knth

11 stars 0 forks source link

Closely Related Material #1

Open c-blake opened 4 months ago

c-blake commented 4 months ago

..is https://lemire.me/blog/2017/06/14/quickselect-versus-binary-heap-for-top-k-queries/ which I think was not aware of the poorly propagated application to the small-ish k / large stream case of: https://quickwit.io/blog/top-k-complexity. Been meaning to follow-up on the latter in the context of my topn program with quickselect & a strong partition strategy, but haven't gotten to it yet.

Also, adix/mvstat uses adix/lghisto which uses adix/bist to implement rarely seen (never anywhere else by me, anyway) Fenwick Tree-based Parzen-interpolated mid-quantiles. Fenwick trees allow update/query costs that are O(log(quantile accuracy))) while Parzen interpolation maximizes accuracy for given space costs. Together this pays off on rolling / moving data windows. For very high accuracy, in the dynamic/moving case, you probably do a bit better with a b-tree.

felipecruz commented 4 months ago

Thanks for all the material. I need some time to digest :)

Are you working on some specific application domain?

My focus is vector similarity search. The quickselect/median function was just a piece of code for other structures such as KD-trees and others.

BTW, I just committed quickselect with median-of-median, better than the random pivot. As there is no internal tracking of what is hapenning, to get all k elements you need to call it k times which is cleary worse than the $O(n \log(k))$. It was not my original intention to implement but maybe for fun :)

c-blake commented 4 months ago

Are you working on some specific application domain?

I like order statistics like quantiles in general due to their robustness, and small k top-k is often the most interesting tail of some data. Such distributional summaries are mostly what I think of when I think of these ideas, but of course "ordering" has a great many applications.

felipecruz commented 4 months ago

@c-blake nice! I'll keep adding related things here.. but the algorithm to return the top K (not just the K-th element) might be a nice addition in the future.