Open marcusklaas opened 4 years ago
I'm not sure how significant the performance gains world be, but in principle such an alternative implementation could live in rand_distr
. There is a precedent: we already have a WeightedIndex
implementation employing the alias method there.
update_weights
is not appropriate for your use-case?
Well, that doesn't handle insertions/deletions. One way of getting around that might be to allow dummy entries with weight zero and over-allocate initially, periodically re-allocating, and tracking a "free list". Getting good performance may require a little more work but it's not obvious which would perform best.
Also note that due to floating-point inaccuracies you might want to periodically re-calculate the cumulative weights anyway.
Sorry, I'm forgetting that one still needs to adjust all cumulative weights after the modified entry, which makes that API especially bad for random updates. Yes, a tree based structure storing the sub-tree total in each node could be a decent alternative, if you don't want to just create a new WeightedIndex
for each sample (O(n^2)
, assuming the number of samples is proportional to the number of entries).
I've recently implemented the 'tree based structure storing the sub-tree total in each node' outlined by @dhardy, and would be interested in upstreaming it.
The implementation is very straight forward. The tree is stored as a flat vector, and the value of each node is defined as follows:
value(i) = weight(i) + value(2*i+1) * value(2*i+2)
For sampling an index, we sample a target_weight
uniformly between between 0
and total_weight of the entire tree
. Then we start walking the tree from the root and repeatedly descend to the child until we reach the node index that corresponds to the target_weight
.
The runtimes of the operations are:
O(n)
O(log n)
O(log n)
Would a PR for this be welcome? I'm happy to work through any discussions.
Open questions are:
WeightedIndex
raises an error before the update. I actually found this inconvenient and would prefer if we either switch to uniform sampling or raise an error on sampling.Sorry for not answering sooner.
Would a PR for this be welcome? I'm happy to work through any discussions.
Yes, I think so — but I'm not so sure where. We already have rand::distributions::WeightedIndex
and rand_distr::weighted_alias::WeightedAliasIndex
. Probably the only reason to have WeightedIndex
in rand
is for use by rand::seq::SliceRandom
so likely to rand_distr
. @vks?
What should happen when the total sum of weights in the tree is zero?
For usage with a fixed distribution of weights it surely makes sense to error as soon as possible (i.e. at construction time). For usage with mutating weights, perhaps there is utility in not raising an error until sampling time. I don't think we should just switch to linear sampling (there may be arguments around rounding errors, but if an item is added with weight=0 I would expect it to never be sampled).
Should we allow adding new nodes?
It sounds like the whole point of a tree-based weighted distribution is to support mutation, so yes. As a result I would expect the API may need to differ a little from other weighted distributions.
No worries, hope you had a good new years.
Sounds good. I'll work on a PR targeting rand_distr
. And I'll raise an error at sampling time then and allow adding nodes.
Perhaps we could add a link to the tree-based implementation in the documentation of rand::distributions::WeightedIndex::update_weights
, so that people can find it if they have the use case.
I saw that rand_distr
was last published 2 years ago. Would we be able to release a new version in the foreseeable future?
Please take a look at https://github.com/rust-random/rand/pull/1372 which should be as discussed.
Perhaps we could add a link to the tree-based implementation in the documentation of rand::distributions::WeightedIndex::update_weights, so that people can find it if they have the use case.
Please do.
Updated the docs and also added overflow handling. PTAL when you get a chance.
Background
What is your motivation? The current
WeightedIndex
is pretty good. Lookups perform very well. And while it's possible to do updates, they are potentially slow as the (average) complexity is O(n). For matchmaking systems where updates are about as frequent as samples, this may be better served by an underlying datastructure which offers quick updates.What type of application is this? (E.g. cryptography, game, numerical simulation) Real-time matchmaking
Feature request
If I'm not mistaken, balanced tree structures like AVL trees or red-black trees enable O(log(n)) operations (search/ sample, update, insert, delete). The constant coefficient will obviously be worse than an implementation using a vector, but for large n, it should pay off.
Is this something that could have a place in rand? I'd be willing to take a shot at it if so.