vincenzocoia / distplyr

Draw powerful insights using distributions with this R package.
https://distplyr.netlify.app/
Other
2 stars 2 forks source link

Suite of parametric distributions #16

Closed vincenzocoia closed 2 years ago

vincenzocoia commented 3 years ago

I've deliberately been delaying creating a collection of parametric distributions to be made available in distplyr, so that package evolution/design could be more nimble, and in case we came up with an idea for automating such a thing. This issue elaborates on the latter: somehow automating their creation. We've discussed two ideas so far:

Idea 1: brute force creation of files

The idea is to make one or more R scripts per distribution that's made available through R (like geom, gamma, beta, binom, etc.), containing all the functions needed for each distribution.

Problem: requires brute force and lots more lines of code to test; does not address the issue of allowing a user to add their own parametric distribution.

Ideas for execution:

I'm not digging this option.

Idea 2: make a function that gives access to a distribution

The idea is that, if a user has r, p, q, d functions available for a distribution (say "unif"), then they would be able to access it as a distplyr distribution by doing something like:

d <- dst_parametric("unif", min = 0, max = 1)

It (or related functions) would then do the work of appending the appropriate prefix whenever the user calls representations like eval_cdf(): perhaps eval_cdf.parametric() is dispatched, which grabs the "unif", appends p as a prefix, plugs in the parameters, and evaluates at the at argument.

If we do have this functionality, we should use it as developers, too. It would cut down our code dramatically. For example, we could code up dst_unif like so:

dst_unif <- function(min, max) dst_parametric("unif", min = min, max = max)

There's one problem remaining: there are simple formulas for quantities like the mean, variance, range, etc. We also need to specify if the distribution is continuous or discrete somehow. I'm thinking we can have functions like set_*() to specify these. So, maybe something like this:

d <- dst_parametric("unif", min = 0, max = 1) %>%
    set_mean((min + max) / 2) %>%
    set_range(min, max)
...etc
dst_unif <- function(min, max) {
    dst_parametric("unif", min, max) %>%
        set_mean((min + max) / 2) %>%
        set_range(min, max)
}

One big problem with this is that these additional pieces of information would have to be stored with the distribution itself, and we're trying to keep the things stored by a distribution at a minimum. Perhaps storing these things is OK on the user-facing side, but it's not when it comes to the developer-facing side of things.

Where would this information be stored for developers including a distribution like "unif"? Currently, we're storing this info in a verbose way: as specific methods (like mean.unif()), which have the overhead of needing to wrap the expression (like (min + max) / 2) with a bunch of function verbiage. An alternative? Maybe a JSON file, or simply a text file, containing information about these quantities, which could then be read by the specific method: mean.parametric() could look for the appropriate JSON/text file (for, say, "unif") and grab the relevant piece. If we go this route, we wouldn't use the set_*() functions as developers. This "data file" might look something like:

mean: (min + max) / 2
range: min, max
...

This approach is sounding far more robust and trustworthy than Idea 1. It would be great if we could get something to work here.

vincenzocoia commented 3 years ago

One last thing that might be tricky with Idea 2: we'd also need a way to specify what the discrete values are in the distributions. More specifically, how to "navigate" them with functions like next_discrete() and prev_discrete().

One rudimentary idea: specify it in an argument, like .discretes. Options would be specifying each and every discrete value, like .discretes = 0:10 (say, for a binomial distribution with n = 10), or .discretes = "natural" for natural numbers. Then, depending on what's specified, we can have pre-defined discrete value navigators like next_discrete_finite() and next_discrete_natural() as we have now. This idea does not sound user-friendly, though, nor robust.

vincenzocoia commented 3 years ago

One new challenge with the new structure is that distplyr verbs sometimes have specific methods for specific types of parametric distributions.

Example: flip(dst_norm(0, 1)) dispatches flip.norm(), which just negates the mean, returning another normal distribution.

vincenzocoia commented 3 years ago

@yelselmiao @zhuzp98 The next task is to populate the parametric distributions found within R.

Distributions that come with R

Distributions to add: I think it would make sense to include all distributions that "come with" R -- if you think this is excessive (for example, maybe some distributions are too "niche" to be useful), let's have that conversation. Probably the best way to find a list of these distributions is to type q and then press tab -- not many functions start with q, except R's distributions do.

The workflow would be:

  1. Make a new entry in the .quantities list, named according to its name in R (example: rweibull, dweibull, etc would translate to the name "weibull").
  2. Populate the distribution's quantities, using the below scaffold. Remove quantities that don't have a closed form, and comment out lines that you don't know (probably EVI, since that's often not given).
    rlang::exprs(
        mean = FILL_THIS_IN,
        median = FILL_THIS_IN,
        variance = FILL_THIS_IN,
        skewness = FILL_THIS_IN,
        kurtosis_exc = FILL_THIS_IN,
        range = c(FILL_THIS_IN, FILL_THIS_IN),
        evi = FILL_THIS_IN
    )
  3. Make a new dst_<name> function for that distribution, and put it in its own file, with the same name as the function name.
    • You should check that the inputted parameters are valid first, and throw an error if not.
    • The last line should be a call to dst_parametric.

Extreme Value Distribution

Can you also add the generalized extreme value distribution to the list? It doesn't "come with" R, so it requires more work. Its name should be "gev".

dst_ functions for each entry of .quantities

For example, binom:

dst_binom <- ...

Perhaps just look at other examples, like dst_unif(), to get an idea of how to make such a function. General structure:

  1. Check that the parameters are valid.
  2. Plug in the dst_parametric() function.

Testing quantities

We will next need to check that the formulas were inputted correctly. We can automate this. If you have time and are up for a challenge, give this a try.

The idea is to check each quantity against its manual calculation using the distributional representations (like quantile function, cdf, etc.).

I'm thinking something like this:

  1. Loop along each distribution in .quantities.

  2. Get the "formula version" of the quantity, by executing the function directly on an example distribution (such as mu1 <- mean(distribution))

  3. Calculate each quantity present for that distribution using the distributional representations, by accessing the .dst method. So, mu2 <- mean.dst(distribution).

  4. Compare the two: expect_equal(mu1, mu2). Hopefully numerical precision isn't a problem here.

Testing manually coded distributions, like GPD and GEV

For distributions that we had to code manually, like the GPD and GEV, we should also check that the representations are coded correctly. So, if we coded the cdf, density, and quantile function, we could do this by:

  1. calculate numerical derivative of the cdf at a bunch of points, and compare against the density (shows that cdf and density align);
  2. check that eval_cdf(distribution, eval_quantile(distribution, 1:9/10)) returns 1:9/10 (shows that cdf and quantile function align); and
  3. check that the density integrates to 1 (shows that the distribution itself is valid).
yelselmiao commented 3 years ago
  1. Make a new entry in the .quantities list, named according to its name in R (example: rweibull, dweibull, etc would translate to the name "weibull").

Added the following distribution:

I am not sure about the EVIs, and some of the ranges and medians

@zhuzp98

vincenzocoia commented 3 years ago

Excellent! I couldn't find Laplace and Fatigue life in R. Do you think these are important distributions to include in the package? If so, we should talk about to what extent we want to populate specific parametric distributions, as opposed to letting someone load a package, giving them access to a distribution's r, p, d, q functions, and accessing it through dst_parametric().

(I made this comment in another pull request, but I think it goes better here).

yelselmiao commented 3 years ago

Do you think these are important distributions to include in the package?

I was referring to the distribution gallery here

vincenzocoia commented 3 years ago

Throwing an error when no such distribution exists

An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:

dst_parametric("foobar", location = 1)

...should throw an error.

Idea: use the exists() function to see whether or not pfoobar, qfoobar, and dfoobar all exist. If not, throw an informative error message.

vincenzocoia commented 3 years ago

Important: changes should be made in the distionary package now!

vincenzocoia commented 3 years ago

Distributions to include in .quantities

Named in R by:

beta
binom
cauchy
chisq
exp
f
gamma
geom
hyper
lnorm
nbinom  # parameterized by size and prob only, not by mu.
norm
pois
signrank
t
unif
weibull
wilcox

Not included in R:

gev    # Generalized Extreme Value Distribution
gpd   # Generalized Pareto Distribution
bernoulli
yelselmiao commented 3 years ago

Task list for easy tracking

Commits are made in distonary.

Quantities_list

dst_ function for that distribution

Representation dst_

vincenzocoia commented 3 years ago

Let's continue this discussion in the issue in distionary of the same name.

yelselmiao commented 2 years ago

Throwing an error when no such distribution exists

An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:

dst_parametric("foobar", location = 1)

...should throw an error.

Idea: use the exists() function to see whether or not pfoobar, qfoobar, and dfoobar all exist. If not, throw an informative error message.

SOLVED: dst_parametric("foobar", location = 1) Error in dst_parametric("foobar", location = 1) : This distribution is not available

vincenzocoia commented 2 years ago

Closing this Issue (see the eponymous Issue in distionary for an explanation).