v0.1.3 remaining - Githubissues

vincenzocoia commented 3 years ago

@toast682 I refined the to-do list we were looking at today. Here's what's relevant:

Poisson distribution:

[x] @toast682 Use pois instead of poisson to match with base R.
[x] @toast682 Tests for the Poisson distribution.

Baseline functions:

[x] @toast682 Throw an error when trying to evaluate pmf of a non-discrete random variable, and density of non-continuous. Use the variable() function. This should be for the eval_pmf.dst() and eval_density.dst() methods.
[ ] @vincenzocoia Fix the quantile function algorithm (as computed from the cdf):
- [ ] First, check where the solution lies with respect to discontinuities.
- [ ] Second, if the inverse does not fall directly on a discontinuity, implement a Newton-Raphson algorithm to find the inverse.

Graft distributions:

[x] Need to reprogram change their definition via graft_left() and graft_right().
[x] Then, need to change how .graft() methods operate (or ensure the existing ones work).

More insight about graft_left() and graft_right():

Perhaps class should be: c("graft", "dst"). The object should store (1) the two distributions, (2) the cutoff point at which the distribution will be grafted, and (3) some indication as to whether a distribution has been grafted to the left or the right (or perhaps allow for both possibilities to be stored). As with mix(), grafting two finite distributions should result in a finite distribution. But, unlike mix(), if a graft distribution is further grafted, there is no simplification that can happen around the grafting of the original distributions.

What is a graft distribution?

Imagine taking a cdf, and cutting off its right tail (after some specified point). Replace the cut-out section with another cdf (cdf2), just rescaling cdf2 vertically so that it connects with the original cdf. (We only use cdf2 beyond the cutoff point). This is a right-graft; similar concept applies to a left-graft. Useful for extreme value models.

vincenzocoia commented 3 years ago

Demonstrations of distplyr

Here are a few use cases that should be demonstrated in the paper.

1. Predictive Distributions

Machine learning models tend to produce a single value, typically a mean, to use as a prediction. With distplyr, we can easily produce a predictive distribution. By way of demonstration, consider a kernel smoothing method with one predictor -- although the concept extends to multiple predictors and other machine learning methods.

Here is a predictive distribution of a penguin's flipper length if we know its bill length is 30mm. A Gaussian kernel with standard deviation of 2.5mm is used. Here's the cdf and a 90% prediction interval:

library(palmerpenguins)
library(distplyr)
yhat <- dst_empirical(flipper_length_mm, data = penguins, weights = dnorm(bill_length_mm - 30, sd = 2.5))
plot(yhat, "cdf")

eval_quantile(yhat, at = c(0.05, 0.95))
#> [1] 178 198

^{Created on 2021-07-02 by the reprex package (v0.3.0)}

TO DO:

Perhaps use a different dataset that isn't clustered by species like this one is. There should be a decent amount of data available (not like mtcars).
To further demonstrate the point, it might be worthwhile comparing a prediction at a different X value. But otherwise, this is enough of a demonstration here.
Later, if it makes sense with the application, I'd like to graft a GPD onto the upper tail.

2. Predictive Distributions as a secondary model

I've encountered a situation where a company's first priority was to predict median house price as best as they could, so they fit a machine learning model that predicted the median. By way of demonstration, we can do the same thing, but just with a simple machine learning model, like kNN (which works by taking the k nearest neighbours to some x value of interest, and then taking the median y value). We don't have to worry about optimizing k, because it's just a demo.

TO DO (this part does not require distplyr):

Find a dataset to make this prediction. Even one X variable is enough to demonstrate concepts. Note that the Y value should be positive (see next).
Fit a kNN model, predicting the median. Plot the resulting smooth curve through the data representing the median predictions. Don't formally optimize the hyperparameter k -- just make sure the choice results in a reasonably smooth curve through the "middle" of the data. We'll also need to make sure we have predictions for each point in the dataset (for the next step).

As a secondary goal, the company wanted a prediction interval, and it was determined that fitting a lognormal distribution for Y (given X) was desirable. Although distplyr currently isn't fully up to this task, it can handle a simple version of this task, in the case where the second parameter of the lognormal distribution ("log variance") is constant -- we can calculate the variance of the residuals on the log scale. Here's how:

TO DO:

Calculate the residuals on the log scale: take the log of Y, then subtract the log of the median for each observation to get residuals.
Ideally, the spread/variance of these residuals stays pretty stable across X. If so, just ignore X and calculate the variance of these log-scale residuals -- this is the second "log variance" parameter of the lognormal distribution we're after.
Now to form a predictive distribution at a particular X value, we just fill in the two parameters of the lognormal distribution: (1) the log mean equals the log of the median prediction; (2) the log variance is the constant calculated in the previous step.
To actually show these predictive distributions, it would be nice to plot a few of them somehow, maybe somehow on top of a scatterplot of the data. Perhaps also using geom_ribbon() to plot a 90% prediction interval.

3. Predictive distribution from a GLM (like Poisson Regression)

There's a technique called Poisson regression, useful whenever the Y variable is a count variable. It fits the mean of Y as an exponential function of X, while also assuming that the distribution of Y (given X) is a Poisson distribution. To demonstrate a predictive distribution here, we'd just have to use a Poisson regression model to make predictions of the mean, and plug those mean predictions into dst_pois().

TO DO:

It might be enough to use the example of Poisson regression that's given in the documentation of the glm() function:
```
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
data.frame(treatment, outcome, counts) # showing data
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
```
We'd just have to call the predict() function with type = "response" as an argument to get mean predictions, which can then go into dst_pois(). Again, demonstrate maybe two predictive distributions (maybe just by plotting their PMF's).

Others

Some other ideas:

Maybe graft a GPD onto the predictive distribution tails, in one of these examples that predicts a finite distribution (maybe Example 1).
Maybe demonstrate binning a numeric X variable, by mix()ing the predictive distributions contained in the bin being predicted on. (I encountered this in a house price prediction problem: training data had exact X values -- something like square footage or list price -- but sometimes only a "bin" is available for prediction, such as "square footage between 1000 and 2000").

vincenzocoia commented 3 years ago

enframing multiple distributions

I've come across wanting to enframe many distributions at once, but the suite of enframe_ functions only allow one distribution.

Here's an idea: enframe(d1, d2, d3, at = 1:10), making the first argument of the enframe_ functions ....

NOTE: allow for !!! input by first quoting: ellipsis <- rlang::quos(...); then evaluating: rlang::eval_tidy(ellipsis).

vincenzocoia commented 2 years ago

This thread has become a dumping ground, but also, a lot of the things on here have been taken care of.

Demonstration has been moved to the "other-rmd" folder temporarily.
The quantile algorithm task has been moved to an issue in distionary.

vincenzocoia / distplyr

v0.1.3 remaining #11

Poisson distribution:

Baseline functions:

Graft distributions:

Demonstrations of distplyr

1. Predictive Distributions

2. Predictive Distributions as a secondary model

3. Predictive distribution from a GLM (like Poisson Regression)

Others

enframing multiple distributions