krlmlr / wrswoR

A package with different implementations of weighted random sampling without replacement in R
https://krlmlr.github.io/wrswoR
17 stars 2 forks source link

Add sampling metod based on Gumbel-Max trick? #6

Open vgherard opened 3 years ago

vgherard commented 3 years ago

Hi Kirill,

I've implemented an algorithm, call it gsample(), for sampling w/o replacement using the "Gumbel-Max trick" (more or less along the ideas presented in Sections 2.3 and 2.4 of this paper).

I tried to benchmark gsample() against the functions in wrswoR and it appears to be competitive with some of them, both from the space and time point of view (it beats ccrank(), expjs() and rej() in several benchmark points, but is probably always outperformed by, e.g., expj()).

I was wondering whether you might be interested in adding this algorithm to wrswoR, in which case I could provide a pull request trying to adapt my present code (which is in Rcpp). Actually, if you want to have a look, here's the repo.

Cheers,

Valerio.

vgherard commented 3 years ago

Looking a little bit closer, Gumbel-Max is simply a reformulation of the _rank() algorithm in log-probability space.

I'm surprised to see that the R version is so fast (and, lovely, to be sure!). This is probably due to inefficient C++ code, since the R version has time complexity O(n log(n) + size), whereas my C++ version has complexity O(n + size log(n))).

krlmlr commented 3 years ago

Thanks. This is the first time I hear the name "Gumbel-Max", at the very least it's worth mentioning in the DESCRIPTION and across the documentation. I don't remember taking the logarithm twice, though.

I'm not spending much time here these days. I'm more than happy to give you access to the repository so that you can make the changes.

I have a draft paper that discusses performance in Section 5.

vgherard commented 3 years ago

Thank you, I will be definitely happy to add some references in the documentation!

The "Gumbel-Max" uses the same algorithm as the *_rank() functions (except perhaps that I'm using partial sort, as you explain in the paper, while at least the R function does total sort). It uses g[k] = log(prob[k]) - log ( - log(unif(0,1) ) as order variable, the -log(-log(U)) are standard Gumbel variables. Thus, on a second thought, I'm not sure it is worth adding another function, I need to look a little bit closer to the C++ implementations.

Oh, thanks for the draft, it is interesting.

Valerio