r-devel / r-dev-day

Repo to organize tasks for R Dev Days
7 stars 2 forks source link

Prepare unequal sampling probability code for inclusion in base R #38

Open hturner opened 3 months ago

hturner commented 3 months ago

This follows from a project at R Project Sprint 2023, the project page gives and overview of the issue and a link to an open access article for background reading: Enhancing sample.int for unequal probability sampling.

After R Project Sprint 2023 @dickoa wrapped up the prototype code into a package: https://github.com/dickoa/sondage.

@tslumley would like to pick this work up again, he says "I think the design is basically done, so it's a matter of integrating it into R".

This will likely involve preparing a patch to the relevant R, C and Rd (help) files and adding tests.

There are a few options for working collaborative on a patch:

  1. Make changes on a fork of the r-svn GitHub mirror and add collaborators to the fork and/or reviewers to a PR back to the r-svn mirror.
  2. Create the patch in the R Dev Container and use live share for collaboration.

As @tslumley will be remote at R Dev Day @ Hutch, and it is not confirmed if @dickoa can attend, the first approach is probably better in this case, allowing asynchronous collaboration. It is a bit trickier though - contributors will need to be able to build R on their own machine and there are some changes you need to make to the usual workflow:

dickoa commented 3 months ago

Unfortunately, I won't be able to be in Seattle had to cancel my trip last minute because of an emergency at work, but I'm really interested in continuing working on this with @tslumley and other persons interested in helping.

nzgwynn commented 3 months ago

Notes from @tslumley for Hutch Dev Day

The idea of adding a method= argument to sample_int() and sample() with the default being the current behaviour and the new argument being last should prevent most breakage of existing code

We might want to change the name of the method= argument to make it clearer that it only applies to unequal probability sampling without replacement. One possibility is prob_method=, which at least shows that it's only for unequal probability sampling.

One thing that does need fixing is the use of malloc and calloc -- these should use the R heap so that the memory is still freed if the computation is interrupted in R.

double *filtered_pik = malloc(count * sizeof(double));
int *original_idx = malloc(count * sizeof(int));
int *sb = calloc(count, sizeof(int));
double *p = malloc(count * sizeof(double));

https://cran.r-project.org/doc/manuals/R-exts.html#Memory-allocation describes using R_alloc. If that isn't possible, R_Calloc/ R_Free is preferable for automatically giving R-level errors if memory allocation fails

nzgwynn commented 3 months ago

Notes from @dickoah for Hutch Dev Day:

At the C level, this is probably where should add the new algorithm: https://github.com/wch/r-source/blob/trunk/src/main/random.c

The hardest part of this work is the design, and how to correct the issue without introducing breaking changes. The sondage package reflects some of our discussions and design decisions.

nzgwynn commented 3 months ago

@tslumley talk with his Lumliness...

R code --- Update sample_int() docs in .Rd ---

@dickoa sample_int() R code needs to update sample_int() in R. Lines 106, 111, needs to call the .Internal() code not sample_int().

Line 108 - should be calling sample_int() with, first arg as is, second arg length of first, replace = FALSE, prob = NULL, useHash = FALSE...

Once built microbench() to see how much we've slowed it down... may make some difference.

sequential method is what we currently have, so shouldn't break code

Change sample_int() or sample()... arguments added to sample() as well, will need match.arg()

Should go to: src/library/baseR and should not be exported.

utils - lines up to 18 nested in sample_pps() function

C code ---

added to: src/library/main/random.c

what do they need to have done to them so they are callable with .Call() what else is callable with .Call()?

nzgwynn commented 3 months ago

May have to leave sample_int() and put all new coding in sample() b/c everything in base function are visible and we don't want to write documentation for everything... like up_brewer() and Martin wants all functions to have documentation.

sample.pps() name of new function that can be called separately as another function.

nzgwynn commented 3 months ago

Must cite Yves Tilles and Alina sample package as used their code: https://cran.r-project.org/web/packages/sampling/index.html Also Brewer 1975:

Title A SIMPLE PROCEDURE FOR SAMPLING πpswor Author(s) Brewer, Kenneth E. W. ISSN: 0004-9581 DOI: 10.1111/j.1467-842X.1975.tb00954.x

nzgwynn commented 3 months ago

Write new example to demonstrate the purpose --- small one to show that Poisson doesn't give the requested result. Marginal does it correctly, but slowly.

tslumley commented 3 months ago

Ok. It appears that for the 'base' package we can't use .Call and have to use .Internal (and then register the C functions in src/main/names.c). It would be tider then to move more of sample_pps to C so that only one new C function needs to be added

hturner commented 2 months ago

gwynn and I reviewed this in the R Contributor Office Hour yesterday and it took a while for us to get our heads round it again. So I have written some notes and a TODO list on this HackMD.

hturner commented 1 month ago

@njtierney and I worked on this when he visited after RSECon24. We transferred the R code, C code and documentation from {sondage} to the unequal-prob-sampling branch of my fork of the r-svn repo, validated the code worked as expected, then did some refactoring of the R code.

The next step is to start refactoring the C code.

dickoa commented 1 month ago

This is great @hturner and @njtierney!