greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
542 stars 63 forks source link

More model examples #44

Closed goldingn closed 6 years ago

goldingn commented 7 years ago

Especially hierarchical models, as in the BUGS examples and Stan's manual.

goldingn commented 7 years ago

in a vignette and/or in a pkgdown - generated website would be best

goldingn commented 7 years ago

please comment here if there's a particular model/model type you think there should be an example for

LVG77 commented 7 years ago

Hello Nick,

Can you please give me some guidance as to whether I can fit the following autoregressive time series STAN model with Greta:

Here is the STAN code:

// Stochastic Unit Root

data {
int<lower=1> T;
vector[T]    y;
}

parameters {
  real sigma_1;
  real sigma_2;

  vector[T] omega;
  real mu_omega;
  real phi;
}

transformed parameters {
  vector[T] rho;
  for (t in 1:T) {
      rho[t]<- exp(omega[t]);
  }
}

model {
  //  priors

mu_omega ~ normal(0, 1);
phi ~ normal(0,1);
sigma_1 ~ beta(1, 1);
sigma_2 ~ beta(1, 1);

  // initial conditions as latent data: diffuse prior
 omega[1] ~ normal(0, 1);

 for (t in 2:T) {
    y[t] ~ normal(rho[t]*y[t-1], sigma_1);
    omega[t] ~ normal(mu_omega + phi*(omega[t-1]-mu_omega), sigma_2);
 }
}

The example is taken from here: http://tharte.github.io/mbt/mbt.html#sec-4-0-2

Thank you for your help. Cheers, LG

dnbarron commented 7 years ago

It would be great to have examples that show how to implement BUGS and Stan examples. So far, my attempts haven't been very successful.

goldingn commented 7 years ago

Hi,

Thanks for posting this! This is the first time series model I've fitted in greta, so it was pretty useful for me to think about. You definitely want to implement it differently than you would in Stan, so I'll add some time series examples to the greta docs soon!

First up, here's a pure R function to generate fake data from the model:

# function to simulate fake data from the process, with known parameters
fake_data <- function (time = 100,
                       phi = 0,
                       mu_omega = 0,
                       sigma_1 = 0.5,
                       sigma_2 = 0.01) {

  # empty vectors
  omega <- y <- rep(0, 10)

  # initial conditions
  omega[1] <- rnorm(1, mu_omega, 1)
  y[1] <- 0

  # iterate state
  for (t in 2:time) {
    omega[t] <- rnorm(1, mu_omega + phi * (omega[t-1] - mu_omega), sigma_2)
    y[t] <- rnorm(1, exp(omega[t]) * y[t-1], sigma_1)
  }

  # return time series
  y

}
set.seed(2017-05-03)
y <- fake_data()
plot(y, type = 'l')

image

Now we can fit the model using greta. Like R code (and unlike Stan code), greta is much more efficient when the code is vectorised, and for-loops slow things down considerably. So in the below, I have firstly decentred the main ar1 process, dealing with a zero-mean, unit-variance version, then adding the mean and multiplying by the standard deviation after the fact. Secondly, I state the ar1 process via its mean function (0s) and covariance function representation (AKA a Gaussian process), rather than iterating through the process as in the Stan version. That means writing the covariance function as an R function, then sampling the ar1 process via the multivariate normal distribution. When I've implemented the Gaussian process module, specifying this sort of model shoul be much easier.

time <- length(y)

# matrix of absolute difference in time between observations
time_diff <- abs(outer(2:time, 2:time, `-`))

# covariance function of the ar1 process, with marginal variance parameter 1
ar1_cov <- function (phi, time_diff)
  (phi ^ time_diff) / (1 - phi ^ 2)
library(greta)

# define parameters and their priors
sigma_1 = beta(1, 1)
sigma_2 = beta(1, 1)
mu_omega = normal(0, 1)
phi = normal(0, 1)

# # iterate the process (zero-mean, variance 1, ar1 process)
# omega_raw <- zeros(time)
# omega_raw[1] = normal(0, 1)
# for (t in 2:time)
#   omega_raw[t] = normal(phi * omega_raw[t-1], 1)

# Gaussian process version
omega_raw = multivariate_normal(zeros(time - 1), ar1_cov(phi, time_diff))

# calculate the likelihood in vector fashion
rho <- exp(mu_omega + omega_raw * sigma_2)
likelihood(y[-1]) = normal(t(rho) * y[-time], sigma_1)
model <- define_model(mu_omega, phi, sigma_1, sigma_2)
draws <- mcmc(model, n_samples = 5000)
MCMCvis::MCMCplot(draws)

image

goldingn commented 7 years ago

@dnbarron I agree! I have plans to write some of them up. Are there any in particular I should have a go at first?

dnbarron commented 7 years ago

Any of the (relatively) simple, hierarchical ones, like Seeds )https://github.com/stan-dev/example-models/tree/master/bugs_examples/vol1/seeds) would be great.

dnbarron commented 7 years ago

Actually, I think I've figured out the Seeds example:

library(coda)
library(greta)
#> 
#> Attaching package: 'greta'
#> The following objects are masked from 'package:stats':
#> 
#>     binomial, poisson
#> The following objects are masked from 'package:base':
#> 
#>     %*%, beta, diag, gamma, sweep

seeds <- list(r = c(10, 23, 23, 26, 17, 5, 53, 55, 32, 46, 10, 8, 10, 8, 23, 0, 3, 22, 15, 32, 3),
     n = c(39, 62, 81, 51, 39, 6, 74, 72, 51, 79, 13, 16, 30, 28, 45, 4, 12, 41, 30, 51, 7),
     x1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
     x2 = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
     N = 21)

x1 <- seeds$x1
x2 <- seeds$x2
x12 <- x1 * x2
r <- as_data(seeds$r)
n <- as_data(seeds$n)
x1 <- as_data(x1)
x2 <- as_data(x2)
x12 <- as_data(x12)
N <- seeds$N

alpha0 <- normal(0, 1000)
alpha1 <- normal(0, 1000)
alpha2 <- normal(0, 1000)
alpha12 <- normal(0, 1000)
sigma <- flat(c(0, 1000))

b <- normal(0, sigma, dim = N)
p <- ilogit(alpha0 + (alpha1 * x1) + (alpha2 * x2) + (alpha12 * x12) + b)

likelihood(r) <- binomial(n, p)
model <- define_model(alpha0, alpha1, alpha2, alpha12, sigma)

draws <- greta::mcmc(model, n_samples = 10000)
summary(draws)

#> 
#> Iterations = 1:10000
#> Thinning interval = 1 
#> Number of chains = 1 
#> Sample size per chain = 10000 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>            Mean     SD Naive SE Time-series SE
#> alpha0  -0.5524 0.1632 0.001632        0.01629
#> alpha1   0.1463 0.2792 0.002792        0.02806
#> alpha2   1.3358 0.2058 0.002058        0.01809
#> alpha12 -0.8114 0.3688 0.003688        0.04338
#> sigma    0.1527 0.2257 0.002257        0.03393
#> 
#> 2. Quantiles for each variable:
#> 
#>              2.5%       25%      50%     75%   97.5%
#> alpha0  -0.881855 -0.658773 -0.55795 -0.4489 -0.2126
#> alpha1  -0.430793 -0.035016  0.14785  0.3346  0.6871
#> alpha2   0.937344  1.201583  1.32940  1.4622  1.7536
#> alpha12 -1.571751 -1.046592 -0.79672 -0.5677 -0.1036
#> sigma    0.003485  0.009258  0.03873  0.2883  0.5501
goldingn commented 7 years ago

This is great, thanks for sharing!

One word of warning - I removed the flat() function in the most recent version of greta and merged its functionality with free(). So if you update greta, you'll need to change the line that uses flat() to:

sigma <- free(0, 1000)

Other than more examples, is there anything you think could be better explained in the greta README or documentation?

dnbarron commented 7 years ago

One thing I haven't been able to figure out is how to put constraints, like upper and lower bounds, on parameters (other than via flat). If that could be explained it would be very helpful.

goldingn commented 7 years ago

Sure thing, here's an attempt to explain:

For random variables with priors, the constraints are implicit in the prior. E.g. variables with a lognormal variable are always in (0, Inf). That's also how random variables are defined in BUGS/JAGS.

Like Stan (but unlike BUGS/JAGS), greta also allows you to specify random variables that don't have priors. Those are specified using free(). You can add constraints to these random variables like: free(lower=0), free(lower = 1.2, upper = 2.2) etc.

In the previous version of greta flat() created the constrained, prior-free variables and free() the unconstrained prior-free variables. It made sense to me to change that so that they all go in one function. flat() was also easy to confuse with a uniform distribution. I'm just about to add a uniform() distribution in the version of greta on the development branch.

pontusps commented 7 years ago

Issue

Hi, I was going to ask about geting help on how to specify an common hierarchical Gaussian regression model (a.k.a. random effect model or linear mixed effects model) in greta. But while writing down my issue I might have managed to solve it myself. However, I'm far from sure that the code is correct, so I thought I'll post it here to 1. see if it can be improved and 2. if you want to make an example out of it.

Objective

I want to model an intercept for each species in the iris dataset (3 levels), when performing a simple linear regression: Sepal.Length is predicted by Sepal.Width, while allowing for a freely varying intercept of each Species-level.

Code

In lme4 the code would look like this: lme4::lmer(data = iris, formula = (Sepal.Length ~ Sepal.Width + 1 + (1|Species)))

In brms (R-wrapper for STAN) the code would look like: brms::brm(data = iris, formula = (Sepal.Length ~ Sepal.Width + 1 + (1|Species)),family = gaussian()) (using some default priors specified by brms). The STAN-code can be obtained by adding the argument save_model = 'StanCode.txt'.

In BUGS or JAGS code it would look something like:

#Data
#y <- iris$Sepal.Length
#x <- iris$Sepal.Width
#groups <- as.numeric(iris$Species) 
#ngroups <- length(unique(iris$Species)) # n species 
#n <- length(iris$Sepal.Length) # n observations

model {
  #Priors
  sigma_int~dunif(0,100) 
  lambda_int <- 1/(sigma_int^2) # Convert sigma into precision

  for (g in 1:ngroups) {
    alpha[g]~dnorm(0,lambda_int) # Random intercepts
  }

  grandalpha~dnorm(0,10) #grand intercepts (or hyperparameter for random intercepts)
  beta ~ dnorm(0,10) #coefficient
  sigma ~ dunif(0,100) # SD 
  lambda <- 1/(sigma^2) # convert sigma into precision

  # Likelihood
  for (i in 1:n) {
    mu[i] <- grandalpha + alpha[groups[i]] + beta*x[i] # linear model
    y[i] ~ dnorm(mu[i],lambda)  # likelihood function
  }
}

grandalpha can also be inserted as the mean-parameter in the distribution specification of "Random intercepts" and removed from the linear model I think.

My attempt to do this in greta:

library(greta)

Species<-as.numeric(iris$Species) #Make Species into numeric vector

# create variables with prior distributions
coefficient = normal(0, 10)
sd = free(0, 10)
intercept = normal(0, 10)

#create variables for random intercepts
sd_intc = free(0, 20)
spec_intercept =  variable(dim = length(levels(iris$Species)) ) 
distribution(spec_intercept) = normal(intercept,sd_intc)

# equation for mean sepal length
mean <- spec_intercept[Species] + coefficient * iris$Sepal.Width 

#likelihood of the observed data
distribution(iris$Sepal.Length) = normal( mean , sd)

#model creation
m <- model(spec_intercept, sd_intc, intercept, coefficient, sd)

#sample from model
draws <- mcmc(m, n_samples = 2000, warmup = 100, thin = 4)

#plot fit and trace
MCMCvis::MCMCtrace(draws)

I'm far from sure that this is correct? Greta does produce similar looking posteriors to brms and JAGS so I assume I'm doing something right. It does seem to generally have more drift in the traceplots but maybe this can be fixed with thinning.

The next step for me is to add "random slopes (spec_coefficient[Species]*iris$Sepal.Width)", using the multivariate normal distribution, but I want to make sure the basics are correct first.

goldingn commented 7 years ago

Wow, thanks for such a detailed post! Multilevel regression is definitely something I need to cover in the examples, so this is really helpful.

What you have there definitely make sense! There are just two things you could change, though neither is critical...


This:

n_species <- length(levels(iris$Species))
spec_intercept =  variable(dim = n_species) 
distribution(spec_intercept) = normal(intercept,sd_intc)

is equivalent to:

spec_intercept =  normal(intercept, sd_intc, dim = n_species)

Though the following, which is also equivalent, will sample more nicely in general:

spec_intercept =  intercept + normal(0, 1, dim = n_species) * sd_intc

See here for an exaplnation of why that's the case, and here for my plans to handle this automatically in a future version of greta.


Your greta version uses a mix of bayesian and frequentist parameters. free() and variable() both return a parameter without a prior (the frequentist way), so sd and sd_intc don't have priors, but intercept and coefficient do. Incidentally, free() is deprecated (it was confusingly similar to uniform()) and will disappear in v0.2. variable() does exactly the same thing.

Assuming you want a Bayesian analysis, there are various options for priors on those parameters. The ones in the BUGS model above are kind of extreme, and happen to work OK for Gibbs samplers (what BUGS/JAGS do) but not HMC (what Stan/greta do). I like lognormal(), and exponential() also makes sense. I wasn't able to install brms due to some dependency issue I'll have to solve another day, but it would be worth seeing what they do. The Stan folks like half-normals for these parameters. You can make one of those in greta like this:

sd = variable(lower = 0)
distribution(sd) = normal(0, 10)

(that's pretty clunky; I'll probably add a truncation option to these distributions in a future version so it becomes a one-liner like: normal(0, 10, lower = 0))

If you want a frequentist analysis, like in lme4, you could define the parameters without priors, like this:

sd_species = variable(lower = 0)
sd = variable(lower = 0)
intercept = variable()
coefficient = variable()

To put that all together, here's what I would do (including my new favourite trace plotting method):

# book-keeping
species <- as.numeric(iris$Species) 
n_species <- length(unique(iris$Species))

# data for greta
x <- as_data(iris$Sepal.Width)
y <- as_data(iris$Sepal.Length)

# random intercepts
sd_species = lognormal(0, 1)
species_intercept = normal(0, 1, dim = n_species) * sd_species

# global intercept and slope
intercept = normal(0, 10)
coefficient = normal(0, 10)

# observation uncertainty
sd = lognormal(0, 1)

# likelihood
mu <- intercept + species_intercept[species] + coefficient * x
distribution(y) = normal(mu, sd)

m <- model(species_intercept, sd_species, intercept, coefficient, sd)

draws <- mcmc(m, n_samples = 2000, warmup = 1000)
bayesplot::mcmc_trace(draws)

not perfect sampling, but passable:

screen shot 2017-06-07 at 10 41 00 pm

In case you haven't checked it out yet, greta now does schematic plots of your model, which may help in interpreting the code:

plot(m)
screen shot 2017-06-07 at 10 42 12 pm
goldingn commented 7 years ago

I think the poor sampling is due to the fact that this type of model is poorly identified with so few groups. You can see, by the fact that intercept is moving in the opposite direction to the species intercepts, that the model can't distinguish whether one level should be high and the other low, or vice-versa.

pontusps commented 7 years ago

Hey, thanks for the input! The intercept-reparameterization really worked well when playing around with the iris and mtcars datasets. I also think the plot(model) is awesome. 😃

I was (obviously) not really trying to model the iris data, but my own, freakishly large, dataset. I was having trouble getting sensible looking chains for the intercept and the sd_intercept. This trick does not fully solve it for me, but I realized that setting good intial_values and decreasing epsilon was absolutely crucial in order to get my model to work (Greta rejected all samples before doing this). Z-transforming the data also helped a lot. All parameters come out with acceptable looking chains (after some heavy thinning) except the intercept (slow drift), with or without the reparameterization.

I'm not sure, but I think that Paul Bürkner (@paul-buerkner) have a trick in his brms::brm() model (at least I'm guessing this from reading the help on the set_prior and parsing through the outputted stan-code) that deals differently with the grand-intercept. It does seems to generally output a very good chain for both the intercept and the sd_intercept-parameter. The code is a bit long to post here but I'll see if I can figure out what he is doing to see if I can mimic it in greta. It's more than possible that it's exactly the same idea of using non-centered distributions, and it's just me that don't get it. Gibbs also seems to produce nice chains for iris, but I don't think neither JAGS or brms is practical to use with the amount of data I have (n>700000).

I have to admit that I was to lazy to read your post just above mine and just made a faulty assumption that free() was the same as uniform() when setting my variance-priors. I changed them to half-Cauchys now (I think Gelman recommended using these for priors on variances in hierarchical models). I think brms uses a weakly-regularizing student_t with df=3 student(df=3,mu=0,sigma >=10) by default, which should be pretty close to a Cauchy with a large scale.

I'll try to implement random slopes as well. I hope you find time to implement the lkj(eta) prior (for the correlation matrix of the var-covar construct) in the future! For now, I'll see if I can wrap my head around the wishart() instead.

Thanks a lot for making Greta. The sampling speed is more than impressive - I can't stop watching the progress-bar when she speeds through my >700k dataset. 😃

goldingn commented 7 years ago

Great, thanks for the update and I'm really glad greta is helping!

Yeah, I'd be really interested to hear if there is another trick to add in. I was thinking the easiest thing would be to use a reference level for the random effects, a la fixed factor models.

I'd not seen the lkj distribution. Looks great and should be easy to implement! Will let you know when it's done.

If you get the latest dev version of greta (devtools::install_github('goldingn/greta@dev')) there should be a slight speedup on the sampler, automatic tuning of epsilon during warmup, and a better chance of greta finding feasible initial values.

paul-buerkner commented 7 years ago

In brms or rstanarm, predictors are centered around zero and the real intercept is recovered later on. This is done explicitely in the transformed data / generated quantities block.

I haven't seen comparisons between Stan (called directly or via brms) and greta, yet, but I know people have used brms for datasets with a few million observations, so that is definitely possible.

goldingn commented 7 years ago

Ah, that makes sense - thanks! It doesn't make much sense for that to be implemented in greta, but I'll point to and recommend that approach in docs, tutorials etc.

Speed comparisons coming soon; I only just got access to a GPU. But to get a rough idea, table 2 here compares Stan with Edward (python library that also uses TensorFlow) on logistic regression with a ~600,000 x 50 design matrix. Parallelising the matrix multiplication gives a 20x (on 12 CPU) to 35x (on GPU) speed up over Stan (on 1 CPU). That's obviously a pretty specific use case; Stan/brms/rstanarm will be much quicker on smaller datasets and possibly for models that aren't dominated by linear algebra.

paul-buerkner commented 7 years ago

Very interesting, thanks! As far as I understand, this is a runtime comparision of one chain, right? For a more thorough comparision, I think one has to take into account multiple chains, lets say 12 for 12 CPUs if one chain runs only on one CPU. Also, the measure of efficiency should be something like (minimum) effective sample size per time. For simple models like logistic regression, the effective sample size will be very close to the actual number of posterior samples for most MCMC methods, but for more complicated models, differences in the algorithms / their implementations will come into play leading to possible substantial differences in the effective sample size. This should be reflected in more advanced benchmarks.

goldingn commented 6 years ago

More examples on the website now, and staged for the next website release. So closing this issue.