Closed vincenzocoia closed 2 years ago
One last thing that might be tricky with Idea 2: we'd also need a way to specify what the discrete values are in the distributions. More specifically, how to "navigate" them with functions like next_discrete()
and prev_discrete()
.
One rudimentary idea: specify it in an argument, like .discretes
. Options would be specifying each and every discrete value, like .discretes = 0:10
(say, for a binomial distribution with n = 10), or .discretes = "natural"
for natural numbers. Then, depending on what's specified, we can have pre-defined discrete value navigators like next_discrete_finite()
and next_discrete_natural()
as we have now. This idea does not sound user-friendly, though, nor robust.
One new challenge with the new structure is that distplyr verbs sometimes have specific methods for specific types of parametric distributions.
Example: flip(dst_norm(0, 1))
dispatches flip.norm()
, which just negates the mean, returning another normal distribution.
is_*
function for each type of parametric distribution? I'm thinking not -- that would be a lot of is_
functions to program, and these is_
functions won't exist if a user creates their own parametric distribution. Perhaps instead of is_norm()
, we can just have is_parametric(object, type = "norm")
. Plus, I don't think these functions would be used very much, anyway.@yelselmiao @zhuzp98 The next task is to populate the parametric distributions found within R.
Distributions to add: I think it would make sense to include all distributions that "come with" R -- if you think this is excessive (for example, maybe some distributions are too "niche" to be useful), let's have that conversation. Probably the best way to find a list of these distributions is to type q
and then press tab -- not many functions start with q
, except R's distributions do.
The workflow would be:
.quantities
list, named according to its name in R (example: rweibull
, dweibull
, etc would translate to the name "weibull").rlang::exprs(
mean = FILL_THIS_IN,
median = FILL_THIS_IN,
variance = FILL_THIS_IN,
skewness = FILL_THIS_IN,
kurtosis_exc = FILL_THIS_IN,
range = c(FILL_THIS_IN, FILL_THIS_IN),
evi = FILL_THIS_IN
)
dst_<name>
function for that distribution, and put it in its own file, with the same name as the function name.
dst_parametric
.Can you also add the generalized extreme value distribution to the list? It doesn't "come with" R, so it requires more work. Its name should be "gev"
.
{}
to write code spanning multiple lines, like I did with the "gpd"
distribution (which also does not "come with" R).representations-dst_gpd.R
. Like the GPD, I suggest including the cdf, density, and quantile function -- although you'll have to derive the quantile function by inverting the cdf.dst_
functions for each entry of .quantities
For example, binom
:
dst_binom <- ...
Perhaps just look at other examples, like dst_unif()
, to get an idea of how to make such a function. General structure:
dst_parametric()
function.We will next need to check that the formulas were inputted correctly. We can automate this. If you have time and are up for a challenge, give this a try.
The idea is to check each quantity against its manual calculation using the distributional representations (like quantile function, cdf, etc.).
I'm thinking something like this:
Loop along each distribution in .quantities
.
Get the "formula version" of the quantity, by executing the function directly on an example distribution (such as mu1 <- mean(distribution)
)
Calculate each quantity present for that distribution using the distributional representations, by accessing the .dst
method. So, mu2 <- mean.dst(distribution)
.
Compare the two: expect_equal(mu1, mu2)
. Hopefully numerical precision isn't a problem here.
For distributions that we had to code manually, like the GPD and GEV, we should also check that the representations are coded correctly. So, if we coded the cdf, density, and quantile function, we could do this by:
eval_cdf(distribution, eval_quantile(distribution, 1:9/10))
returns 1:9/10
(shows that cdf and quantile function align); and
- Make a new entry in the
.quantities
list, named according to its name in R (example:rweibull
,dweibull
, etc would translate to the name "weibull").
Added the following distribution:
I am not sure about the EVIs, and some of the ranges and medians
@zhuzp98
Excellent! I couldn't find Laplace and Fatigue life in R. Do you think these are important distributions to include in the package? If so, we should talk about to what extent we want to populate specific parametric distributions, as opposed to letting someone load a package, giving them access to a distribution's r, p, d, q
functions, and accessing it through dst_parametric()
.
(I made this comment in another pull request, but I think it goes better here).
Do you think these are important distributions to include in the package?
I was referring to the distribution gallery here
An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:
dst_parametric("foobar", location = 1)
...should throw an error.
Idea: use the exists()
function to see whether or not pfoobar
, qfoobar
, and dfoobar
all exist. If not, throw an informative error message.
Important: changes should be made in the distionary package now!
.quantities
Named in R by:
beta
binom
cauchy
chisq
exp
f
gamma
geom
hyper
lnorm
nbinom # parameterized by size and prob only, not by mu.
norm
pois
signrank
t
unif
weibull
wilcox
Not included in R:
gev # Generalized Extreme Value Distribution
gpd # Generalized Pareto Distribution
bernoulli
Commits are made in distonary.
Let's continue this discussion in the issue in distionary of the same name.
Throwing an error when no such distribution exists
An error should be thrown right away when someone tries to make a parametric distribution that does not exist. For example:
dst_parametric("foobar", location = 1)
...should throw an error.
Idea: use the
exists()
function to see whether or notpfoobar
,qfoobar
, anddfoobar
all exist. If not, throw an informative error message.
SOLVED:
dst_parametric("foobar", location = 1)
Error in dst_parametric("foobar", location = 1) :
This distribution is not available
Closing this Issue (see the eponymous Issue in distionary for an explanation).
I've deliberately been delaying creating a collection of parametric distributions to be made available in distplyr, so that package evolution/design could be more nimble, and in case we came up with an idea for automating such a thing. This issue elaborates on the latter: somehow automating their creation. We've discussed two ideas so far:
Idea 1: brute force creation of files
The idea is to make one or more R scripts per distribution that's made available through R (like
geom
,gamma
,beta
,binom
, etc.), containing all the functions needed for each distribution.Problem: requires brute force and lots more lines of code to test; does not address the issue of allowing a user to add their own parametric distribution.
Ideas for execution:
FILL_THIS_IN
everywhere that needs a manual substitution.I'm not digging this option.
Idea 2: make a function that gives access to a distribution
The idea is that, if a user has
r
,p
,q
,d
functions available for a distribution (say "unif"), then they would be able to access it as a distplyr distribution by doing something like:It (or related functions) would then do the work of appending the appropriate prefix whenever the user calls representations like
eval_cdf()
: perhapseval_cdf.parametric()
is dispatched, which grabs the "unif", appendsp
as a prefix, plugs in the parameters, and evaluates at theat
argument.If we do have this functionality, we should use it as developers, too. It would cut down our code dramatically. For example, we could code up
dst_unif
like so:There's one problem remaining: there are simple formulas for quantities like the mean, variance, range, etc. We also need to specify if the distribution is continuous or discrete somehow. I'm thinking we can have functions like
set_*()
to specify these. So, maybe something like this:distplyr::dst_unif
does not exist):distplyr::dst_unif()
:One big problem with this is that these additional pieces of information would have to be stored with the distribution itself, and we're trying to keep the things stored by a distribution at a minimum. Perhaps storing these things is OK on the user-facing side, but it's not when it comes to the developer-facing side of things.
Where would this information be stored for developers including a distribution like "unif"? Currently, we're storing this info in a verbose way: as specific methods (like
mean.unif()
), which have the overhead of needing to wrap the expression (like(min + max) / 2
) with a bunch of function verbiage. An alternative? Maybe a JSON file, or simply a text file, containing information about these quantities, which could then be read by the specific method:mean.parametric()
could look for the appropriate JSON/text file (for, say, "unif") and grab the relevant piece. If we go this route, we wouldn't use theset_*()
functions as developers. This "data file" might look something like:This approach is sounding far more robust and trustworthy than Idea 1. It would be great if we could get something to work here.