Interest in a weighted index based on balanced trees?

marcusklaas commented 4 years ago

Background

What is your motivation? The current WeightedIndex is pretty good. Lookups perform very well. And while it's possible to do updates, they are potentially slow as the (average) complexity is O(n). For matchmaking systems where updates are about as frequent as samples, this may be better served by an underlying datastructure which offers quick updates.

What type of application is this? (E.g. cryptography, game, numerical simulation) Real-time matchmaking

Feature request

If I'm not mistaken, balanced tree structures like AVL trees or red-black trees enable O(log(n)) operations (search/ sample, update, insert, delete). The constant coefficient will obviously be worse than an implementation using a vector, but for large n, it should pay off.

Is this something that could have a place in rand? I'd be willing to take a shot at it if so.

vks commented 4 years ago

I'm not sure how significant the performance gains world be, but in principle such an alternative implementation could live in rand_distr. There is a precedent: we already have a WeightedIndex implementation employing the alias method there.

dhardy commented 4 years ago

update_weights is not appropriate for your use-case?

Well, that doesn't handle insertions/deletions. One way of getting around that might be to allow dummy entries with weight zero and over-allocate initially, periodically re-allocating, and tracking a "free list". Getting good performance may require a little more work but it's not obvious which would perform best.

Also note that due to floating-point inaccuracies you might want to periodically re-calculate the cumulative weights anyway.

dhardy commented 4 years ago

Sorry, I'm forgetting that one still needs to adjust all cumulative weights after the modified entry, which makes that API especially bad for random updates. Yes, a tree based structure storing the sub-tree total in each node could be a decent alternative, if you don't want to just create a new WeightedIndex for each sample (O(n^2), assuming the number of samples is proportional to the number of entries).

xmakro commented 9 months ago

I've recently implemented the 'tree based structure storing the sub-tree total in each node' outlined by @dhardy, and would be interested in upstreaming it.

The implementation is very straight forward. The tree is stored as a flat vector, and the value of each node is defined as follows:

value(i) =  weight(i) + value(2*i+1) * value(2*i+2)

For sampling an index, we sample a target_weight uniformly between between 0 and total_weight of the entire tree. Then we start walking the tree from the root and repeatedly descend to the child until we reach the node index that corresponds to the target_weight.

The runtimes of the operations are:

Constructing the initial tree from a slice of weights: O(n)
Updating the weight of a node (we walk up the tree): O(log n)
Sampling an index (we walk down the tree): O(log n)

Would a PR for this be welcome? I'm happy to work through any discussions.

Open questions are:

What should happen when the total sum of weights in the tree is zero? WeightedIndex raises an error before the update. I actually found this inconvenient and would prefer if we either switch to uniform sampling or raise an error on sampling.
Should we allow adding new nodes? It would be easy to support.

dhardy commented 9 months ago

Sorry for not answering sooner.

Would a PR for this be welcome? I'm happy to work through any discussions.

Yes, I think so — but I'm not so sure where. We already have rand::distributions::WeightedIndex and rand_distr::weighted_alias::WeightedAliasIndex. Probably the only reason to have WeightedIndex in rand is for use by rand::seq::SliceRandom so likely to rand_distr. @vks?

What should happen when the total sum of weights in the tree is zero?

For usage with a fixed distribution of weights it surely makes sense to error as soon as possible (i.e. at construction time). For usage with mutating weights, perhaps there is utility in not raising an error until sampling time. I don't think we should just switch to linear sampling (there may be arguments around rounding errors, but if an item is added with weight=0 I would expect it to never be sampled).

Should we allow adding new nodes?

It sounds like the whole point of a tree-based weighted distribution is to support mutation, so yes. As a result I would expect the API may need to differ a little from other weighted distributions.

xmakro commented 9 months ago

No worries, hope you had a good new years.

Sounds good. I'll work on a PR targeting rand_distr. And I'll raise an error at sampling time then and allow adding nodes.

Perhaps we could add a link to the tree-based implementation in the documentation of rand::distributions::WeightedIndex::update_weights, so that people can find it if they have the use case.

I saw that rand_distr was last published 2 years ago. Would we be able to release a new version in the foreseeable future?

xmakro commented 9 months ago

Please take a look at https://github.com/rust-random/rand/pull/1372 which should be as discussed.

dhardy commented 9 months ago

Perhaps we could add a link to the tree-based implementation in the documentation of rand::distributions::WeightedIndex::update_weights, so that people can find it if they have the use case.

Please do.

xmakro commented 9 months ago

Updated the docs and also added overflow handling. PTAL when you get a chance.

rust-random / rand

Interest in a weighted index based on balanced trees? #1053

Background

Feature request