vincenzocoia commented 3 years ago

I've deliberately been delaying creating a collection of parametric distributions to be made available in distplyr, so that package evolution/design could be more nimble, and in case we came up with an idea for automating such a thing. This issue elaborates on the latter: somehow automating their creation. We've discussed two ideas so far:

Idea 1: brute force creation of files

The idea is to make one or more R scripts per distribution that's made available through R (like geom, gamma, beta, binom, etc.), containing all the functions needed for each distribution.

Problem: requires brute force and lots more lines of code to test; does not address the issue of allowing a user to add their own parametric distribution.

Ideas for execution:

Make template R files that can be copied for each distribution, with something like FILL_THIS_IN everywhere that needs a manual substitution.
Make a function (not unlike those found in the usethis package) that creates said R files, but with everything filled in.
Perhaps because this option would explode the number of files and functions in distplyr, perhaps store these distributions in a separate package, called something like "distionary".

I'm not digging this option.

Idea 2: make a function that gives access to a distribution

The idea is that, if a user has r, p, q, d functions available for a distribution (say "unif"), then they would be able to access it as a distplyr distribution by doing something like:

d <- dst_parametric("unif", min = 0, max = 1)

It (or related functions) would then do the work of appending the appropriate prefix whenever the user calls representations like eval_cdf(): perhaps eval_cdf.parametric() is dispatched, which grabs the "unif", appends p as a prefix, plugs in the parameters, and evaluates at the at argument.

If we do have this functionality, we should use it as developers, too. It would cut down our code dramatically. For example, we could code up dst_unif like so:

dst_unif <- function(min, max) dst_parametric("unif", min = min, max = max)

There's one problem remaining: there are simple formulas for quantities like the mean, variance, range, etc. We also need to specify if the distribution is continuous or discrete somehow. I'm thinking we can have functions like set_*() to specify these. So, maybe something like this:

User-facing creation of a Uniform distribution (in the hypothetical situation that distplyr::dst_unif does not exist):

d <- dst_parametric("unif", min = 0, max = 1) %>%
    set_mean((min + max) / 2) %>%
    set_range(min, max)
...etc

Developer-facing creation of a Uniform distribution for use as distplyr::dst_unif():

dst_unif <- function(min, max) {
    dst_parametric("unif", min, max) %>%
        set_mean((min + max) / 2) %>%
        set_range(min, max)
}

One big problem with this is that these additional pieces of information would have to be stored with the distribution itself, and we're trying to keep the things stored by a distribution at a minimum. Perhaps storing these things is OK on the user-facing side, but it's not when it comes to the developer-facing side of things.

Where would this information be stored for developers including a distribution like "unif"? Currently, we're storing this info in a verbose way: as specific methods (like mean.unif()), which have the overhead of needing to wrap the expression (like (min + max) / 2) with a bunch of function verbiage. An alternative? Maybe a JSON file, or simply a text file, containing information about these quantities, which could then be read by the specific method: mean.parametric() could look for the appropriate JSON/text file (for, say, "unif") and grab the relevant piece. If we go this route, we wouldn't use the set_*() functions as developers. This "data file" might look something like:

mean: (min + max) / 2
range: min, max
...

This approach is sounding far more robust and trustworthy than Idea 1. It would be great if we could get something to work here.

vincenzocoia commented 3 years ago

One last thing that might be tricky with Idea 2: we'd also need a way to specify what the discrete values are in the distributions. More specifically, how to "navigate" them with functions like next_discrete() and prev_discrete().

One rudimentary idea: specify it in an argument, like .discretes. Options would be specifying each and every discrete value, like .discretes = 0:10 (say, for a binomial distribution with n = 10), or .discretes = "natural" for natural numbers. Then, depending on what's specified, we can have pre-defined discrete value navigators like next_discrete_finite() and next_discrete_natural() as we have now. This idea does not sound user-friendly, though, nor robust.

vincenzocoia commented 3 years ago

One new challenge with the new structure is that distplyr verbs sometimes have specific methods for specific types of parametric distributions.

Example: flip(dst_norm(0, 1)) dispatches flip.norm(), which just negates the mean, returning another normal distribution.

Should we keep a subclass? Probably yes, even for user-defined distributions (in case someone wants to add S3 methods to a special distribution of theirs).
Should we have an is_* function for each type of parametric distribution? I'm thinking not -- that would be a lot of is_ functions to program, and these is_ functions won't exist if a user creates their own parametric distribution. Perhaps instead of is_norm(), we can just have is_parametric(object, type = "norm"). Plus, I don't think these functions would be used very much, anyway.

vincenzocoia commented 3 years ago

@yelselmiao @zhuzp98 The next task is to populate the parametric distributions found within R.

Distributions that come with R

Distributions to add: I think it would make sense to include all distributions that "come with" R -- if you think this is excessive (for example, maybe some distributions are too "niche" to be useful), let's have that conversation. Probably the best way to find a list of these distributions is to type q and then press tab -- not many functions start with q, except R's distributions do.

The workflow would be:

Make a new entry in the .quantities list, named according to its name in R (example: rweibull, dweibull, etc would translate to the name "weibull").

Populate the distribution's quantities, using the below scaffold. Remove quantities that don't have a closed form, and comment out lines that you don't know (probably EVI, since that's often not given).

rlang::exprs(
    mean = FILL_THIS_IN,
    median = FILL_THIS_IN,
    variance = FILL_THIS_IN,
    skewness = FILL_THIS_IN,
    kurtosis_exc = FILL_THIS_IN,
    range = c(FILL_THIS_IN, FILL_THIS_IN),
    evi = FILL_THIS_IN
)

Make a new dst_<name> function for that distribution, and put it in its own file, with the same name as the function name.
- You should check that the inputted parameters are valid first, and throw an error if not.
- The last line should be a call to dst_parametric.

Extreme Value Distribution

Can you also add the generalized extreme value distribution to the list? It doesn't "come with" R, so it requires more work. Its name should be "gev".

Its moments depend on the shape parameter, meaning you'll have two or more cases. Use {} to write code spanning multiple lines, like I did with the "gpd" distribution (which also does not "come with" R).
Because the GEV doesn't come with R, you'll need to program its distributional representations. I had to do this for the GPD, which can be found at representations-dst_gpd.R. Like the GPD, I suggest including the cdf, density, and quantile function -- although you'll have to derive the quantile function by inverting the cdf.

`dst_` functions for each entry of `.quantities`

For example, binom:

dst_binom <- ...

Perhaps just look at other examples, like dst_unif(), to get an idea of how to make such a function. General structure:

Check that the parameters are valid.
Plug in the dst_parametric() function.

Testing quantities

We will next need to check that the formulas were inputted correctly. We can automate this. If you have time and are up for a challenge, give this a try.

The idea is to check each quantity against its manual calculation using the distributional representations (like quantile function, cdf, etc.).

I'm thinking something like this:

Loop along each distribution in .quantities.
Get the "formula version" of the quantity, by executing the function directly on an example distribution (such as mu1 <- mean(distribution))
Calculate each quantity present for that distribution using the distributional representations, by accessing the .dst method. So, mu2 <- mean.dst(distribution).
Compare the two: expect_equal(mu1, mu2). Hopefully numerical precision isn't a problem here.

Testing manually coded distributions, like GPD and GEV

For distributions that we had to code manually, like the GPD and GEV, we should also check that the representations are coded correctly. So, if we coded the cdf, density, and quantile function, we could do this by:

calculate numerical derivative of the cdf at a bunch of points, and compare against the density (shows that cdf and density align);
check that eval_cdf(distribution, eval_quantile(distribution, 1:9/10)) returns 1:9/10 (shows that cdf and quantile function align); and
check that the density integrates to 1 (shows that the distribution itself is valid).

yelselmiao commented 3 years ago

Make a new entry in the .quantities list, named according to its name in R (example: rweibull, dweibull, etc would translate to the name "weibull").

Added the following distribution:

binomial
geometric
exponential
Weibull
Gamma
Laplace
Fatigue life
Chi-square

I am not sure about the EVIs, and some of the ranges and medians

@zhuzp98

vincenzocoia commented 3 years ago

Excellent! I couldn't find Laplace and Fatigue life in R. Do you think these are important distributions to include in the package? If so, we should talk about to what extent we want to populate specific parametric distributions, as opposed to letting someone load a package, giving them access to a distribution's r, p, d, q functions, and accessing it through dst_parametric().

(I made this comment in another pull request, but I think it goes better here).

yelselmiao commented 3 years ago

Do you think these are important distributions to include in the package?

I was referring to the distribution gallery here

vincenzocoia commented 3 years ago

Throwing an error when no such distribution exists

An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:

dst_parametric("foobar", location = 1)

...should throw an error.

Idea: use the exists() function to see whether or not pfoobar, qfoobar, and dfoobar all exist. If not, throw an informative error message.

vincenzocoia commented 3 years ago

Important: changes should be made in the distionary package now!

vincenzocoia commented 3 years ago

Distributions to include in `.quantities`

Named in R by:

beta
binom
cauchy
chisq
exp
f
gamma
geom
hyper
lnorm
nbinom  # parameterized by size and prob only, not by mu.
norm
pois
signrank
t
unif
weibull
wilcox

Not included in R:

gev    # Generalized Extreme Value Distribution
gpd   # Generalized Pareto Distribution
bernoulli

yelselmiao commented 3 years ago

Task list for easy tracking

Commits are made in distonary.

Quantities_list

[x] beta
[x] binom
[x] cauchy
[x] chisq
[x] exp
[x] f
[x] gamma
[x] geom
[x] hyper
[x] lnorm
[x] nbinom # parameterized by size and prob only, not by mu.
[x] norm
[x] pois
[ ] signrank
[x] t
[x] unif
[x] weibull
[ ] wilcox
[x] gev when shape = 0, the mean equals location + scale* $\gamma$ I am not sure how to deal with the Euler's constant $\gamma$ here
[x] gpd # Generalized Pareto Distribution
[x] bernoulli

dst_ function for that distribution

[x] beta
[x] binom
[x] cauchy
[x] chisq
[x] exp
[x] f
[x] gamma
[x] geom
[x] hyper
[x] lnorm
[x] nbinom # parameterized by size and prob only, not by mu.
[x] norm
[x] pois
[ ] signrank
[x] t
[x] unif
[x] weibull
[ ] wilcox
[ ] gev # Generalized Extreme Value Distribution
[x] gpd # Generalized Pareto Distribution
[x] bernoulli

Representation dst_

[ ] gev # Generalized Extreme Value Distribution
[ ] gpd # Generalized Pareto Distribution
[ ] bernoulli

vincenzocoia commented 3 years ago

Let's continue this discussion in the issue in distionary of the same name.

yelselmiao commented 2 years ago

Throwing an error when no such distribution exists

An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:
dst_parametric("foobar", location = 1)
...should throw an error.

Idea: use the exists() function to see whether or not pfoobar, qfoobar, and dfoobar all exist. If not, throw an informative error message.

SOLVED: dst_parametric("foobar", location = 1) Error in dst_parametric("foobar", location = 1) : This distribution is not available

vincenzocoia commented 2 years ago

Closing this Issue (see the eponymous Issue in distionary for an explanation).

vincenzocoia / distplyr

Suite of parametric distributions #16

Idea 1: brute force creation of files

Idea 2: make a function that gives access to a distribution

Distributions that come with R

Extreme Value Distribution

`dst_` functions for each entry of `.quantities`

Testing quantities

Testing manually coded distributions, like GPD and GEV

Throwing an error when no such distribution exists

Distributions to include in `.quantities`

Task list for easy tracking

Quantities_list

dst_ function for that distribution

Representation dst_

Throwing an error when no such distribution exists

vincenzocoia / distplyr

Suite of parametric distributions #16

Idea 1: brute force creation of files

Idea 2: make a function that gives access to a distribution

Distributions that come with R

Extreme Value Distribution

dst_ functions for each entry of .quantities

Testing quantities

Testing manually coded distributions, like GPD and GEV

Throwing an error when no such distribution exists

Distributions to include in .quantities

Task list for easy tracking

Quantities_list

dst_ function for that distribution

Representation dst_

Throwing an error when no such distribution exists

`dst_` functions for each entry of `.quantities`

Distributions to include in `.quantities`