Stochastic sorting? - Githubissues

sh1boot commented 7 years ago

Analogous to Approximate counting, consider maintaining only a grand total and the order of the tokens.

Assume that the curve evolves according to the total, approaching some statistical idealised model (is this reasonable? I should look at my data). From that, rather than incrementing counts, we consider decreasing rank by one (possibly more with low totals?) with a probability appropriate to the curve we expect at that position.

Because all I want is a curve and an order, so maybe going straight to it would be the right answer?

Would these tables be easy to merge with approximate matches?

sh1boot commented 7 years ago

Maybe low totals just use BTF. Maybe that could be managed in a cheap data structure until the population reaches something worth transitioning to a vector for? Then throw away tables that were never promoted.

sh1boot commented 7 years ago

I think this extends to monitoring intermediate points (not percentiles, but whatever the equivalent is when you count unique values only once -- fixed ratios along the x axis), if the local gradients can be trusted.

For a point of interest -- eg., the count of the rarest item -- keep an accurate count of the item in that bin. On the occasion that we decide to promote it and move the next highest item down the model says this is happening because they've just passed being equal, so the estimated count for the new entry in that position is the same value and we proceed to maintain the counter according to that new entry.

Keeping several of these points might make it easier to model the curve and consequently to better predict promotion probabilities to properly align the watched points with the appropriate index.

This seems optimistic... how fortunate I am that I don't have time to ruin the magic by applying actual data.

sh1boot / obf

Stochastic sorting? #1