r-devel / r-project-sprint-2023

Material for the R project sprint
https://contributor.r-project.org/r-project-sprint-2023/
17 stars 3 forks source link

Add project on enhancing base sample for uneq. prob. sampling #24

Closed dickoa closed 1 year ago

dickoa commented 1 year ago

Adding the project to work on base::sample.int and make it more friendly for survey statisticians who want to use it for unequal probability sampling.

mmaechler commented 1 year ago

I don't have the time to do extensive research about this and R core member @tslumley (creator and maintainer of the important {survey} package is also participating at the R sprint, so he will probably be the contact person. One remark: The big question is indeed if it is not good enough to keep providing the functionality via CRAN package {sampling} or if some of the alternatives of the current uneq.prob.sampling algorithms should really be in base R's stats package, which I think was the wish of Prof. Tillé originally (when he told me about the problem personally several years ago, but we did not follow up properly).

tslumley commented 1 year ago

I think it's a good idea to add at least one method of (marginal) unequal probability sampling without replacement. I don't know whether adding the computation of pairwise probabilities is important, and that's more complicated.

dickoa commented 1 year ago

Thank you, @tslumley and @mmaechler, for your insights. I agree that adding this option to sample.int would be beneficial for many users, and particularly people working on survey sampling. As @tslumley mentioned, a method for unequal probability sampling based on first order inclusion probability should suffice for most needs.

MetaEntropy commented 1 year ago

In my opinion, the current buggy method should be documented AND should issue a warning message. Indeed, somebody may assume that R procedures are statistically correct by default without checking the documentation. Adding a correct procedure would be great.

skolenik commented 1 year ago

@tslumley few people besides Tille bother with pairwise probabilities. It's an important concept to understand for professional samplers, but is pretty much impractical.

@dickoa I stirred the pot with opening a dplyr issue. I had to read Tille's book to implement this stuff on my actual sampling projects for the U.S. government surveys, and I had implemented the unequal probability sampling from scratch in Stata (hint: the base Stata does not do it at all, either, and the existing user contributions miss the mark just like base::sample() does.) Would be happy to track the progress of this project and provide some QC.

The outline of the code is:

  1. ingest the input / target probabilities;
  2. compute the draw-by-draw probabilities via numeric optimization (Chen 2000) -- these probabilities usually end up being shrunk towards n/N compared to the target probabilities.
  3. implement a max entropy sampling procedure to draw the sample using those draw by draw probabilities.

Tille (2006) book lists five max entropy / conditional Poisson procedures that scale differently for the target size n. For instance, the actual conditional Poisson, as in a literal implementation of selecting the individual cases with their optimized draw-by-draw probabilities, ends up with a sample of random size, and you keep repeating that sampling until you have exactly n units selected. The number of times you have to try it is negative binomial with the success probability of Prob[Poisson(rate==n)==n] = n^n * exp(-n) / n! ~ 1/sqrt(n) using a normal approximation for large Poisson rate. Survey statisticians typically work with very low n (2 PSUs per stratum come to mind), so the number of samples drawn for that design is about 1. R users will likely want to abuse the code and draw samples of n=1e6 out of N=1e9, so the number of samples drawn and discarded is 1e3. Users would have to be prepared to wait some while for the code to run. (No free lunch -- if you want good statistical properties, something gotta give.) Other procedures may not produce as much waste, but they require recomputing the selection probabilities at each step for(i in 1:n), and some of the algorithms that require less compute require n times N memory objects instead.

I suspect that the q-q plots in https://contributor.r-project.org/r-project-sprint-2023/projects/enhance-sample/ are missing the fact that the sampled inclusion indicators are negatively correlated. How it is supposed to change the q-q plots, I can't quite tell without a proper derivation, but it would not be the N(0,1) line.

skolenik commented 1 year ago

In the mean time, documenting the shortcomings of base::sample() and base::sample.int(), promising that it will issue a loud warning in a later release (which can be done once per session or once in 24 hours or whatever the occasional warning mechanism is), is a good first step.