reconhub / covid19hub

Community-driven COVID-19 analytics in R
58 stars 2 forks source link

Package: fit discretised distributions to match expected distributional properties #4

Open thibautjombart opened 4 years ago

thibautjombart commented 4 years ago

Package: fit discretised distributions to match expected distributional properties

Description

Disclaimer: if this exists already, please feel free to point it out here.

What we need: fit a discretised distribution (distcrete object) so that it matches user-specified properties e.g. quantiles (typically the median, and 25% / 75% quantiles)

The package should include documentation, simple reproducible examples, unit tests, and use good practices and standards recommended by RECON or similar.

Specs (option 1)

This version would be slightly easier to use, but more constrained as we explicitely force fitting using quantiles. The method for calculating distances from targets is likely pre-programmed in the function, so only a few options will be available (likely: least square, absolute values, etc).

Specs (option 2)

This version would be more flexible targets and fitting are passed as user-specified functions.

Relevant related packages

Starting point

The following code generates a discretised weibull as a distcrete object, generates a sample of 1e5 values, and calculates the sum of squared distances from reference quantiles:


## define targets
target_quantiles <- c(.25, .5, .75)
target_values <- c(3, 5, 7)

## make distcrete object
x <- distcrete::distcrete("weibull", shape = 2, scale = 4, w = 0.5, interval = 1)

## check quantiles
x_quantiles <- quantile(x$r(1e5), probs = target_quantiles)
sum_square <- sum((x_quantiles - target_values)^2)

Further tweaks will be putting this insided an optimiser. The interface should allow accessing all arguments of distcrete::distcrete in addition to arguments used to specify targets and fitting

Impact

Many epidemiologically relevant delays are shared in the literature as means, medians, IQR, 95%CI etc. However, it can be very tricky to turn these into discretised distributions which we can then use for forecasting, estimating transmissibility etc. A simple utilitary function implementing this would be a great help for this.

Proposed timeline

Focal point

@thibautjombart

samclifford commented 4 years ago

I'm not sure that least squares is the best measure of goodness of fit. The documentation of fitdistrplus::fitdist() indicates that goodness of fit measures tend not to be appropriate for discrete distributions. Is it possible to write out a d, p and q function for the discredited gamma? If so, this could perhap fit into the fitdistrplus framework.

thibautjombart commented 4 years ago

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Good point re fitdistrplus. distcrete objects are lists with $d $r $q $p components.

johnurbanik commented 4 years ago

I'd second just adding some scaffolding around fitdistrplus... it's a pretty solidly designed package.

lguillier commented 4 years ago

The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q. https://cran.r-project.org/web/packages/rriskDistributions/index.html

znmeb commented 4 years ago

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.

thibautjombart commented 4 years ago

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.

Nope, this project really is for cases where we don't have data, only some distributional summaries.

thibautjombart commented 4 years ago

@samclifford @johnurbanik I thought fitdistrplus was only implementing procedures for fitting distributions to data, which would be a different use-case. I may not be up to date. Is there a specific function to create distributions matching distributional summaries you can point to?

thibautjombart commented 4 years ago

The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q. https://cran.r-project.org/web/packages/rriskDistributions/index.html

Yes, this is very close. I have just given it a try and it seems to work very nicely. Only caveat is the discretisation may actually change results a bit, so I am not sure if the optimal parametrisation to match the continuous distribution remains optimal to match against the discrete version. I'd guess it is not in the general case, but should be pretty close?

Example below for a GAmma with median 7 and IQR 3-11:

library(rriskDistributions)
library(distcrete)

## get parameters of a Gamma with median 7, IQR 3-11
pars <- get.gamma.par(p = c(.25,  .5, .75), q = c(3, 7, 11), plot = FALSE)
#> The fitting procedure 'L-BFGS-B' was successful!
#> $par
#> [1] 1.3647791 0.1615613
#> 
#> $value
#> [1] 0.0001973927
#> 
#> $counts
#> function gradient 
#>       42       42 
#> 
#> $convergence
#> [1] 0
#> 
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
pars
#>     shape      rate 
#> 1.3647791 0.1615613

## check that it atcually works
samp_cont <- rgamma(1e5, shape = pars["shape"], rate = pars["rate"])
summary(samp_cont)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>  0.00099  3.16772  6.51986  8.46700 11.65060 76.12839

## discretise the distribution
x <- distcrete("gamma", shape = pars["shape"], rate = pars["rate"], interval = 1)
samp_disc <- x$r(1e5)
summary(samp_disc)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.000   3.000   7.000   8.468  12.000  92.000

Created on 2020-04-14 by the reprex package (v0.3.0)

thibautjombart commented 4 years ago

Still, given how far rriskDistributions gets us, I think this project will make a more meaningful contribution if we go for option 2, i.e. a more general-purpose package to create distcrete objects matching any summary function.

thibautjombart commented 4 years ago

This could be a starting point for option 2: https://github.com/thibautjombart/misc/blob/master/R/distcreate.R