Basis function approximation to Gaussian processes

paul-buerkner commented 6 years ago

GPs in brms or Stan in general are currently terribly slow for larger data sets. The GP approximations discussed in https://github.com/gabriuma/basis_functions_approach_to_GP seem to scale much better with data and I think would make a nice feature for brms.

paul-buerkner commented 6 years ago

@avehtari Approximate Gaussian processes for a single covariate are now implemented in brms on the github branch gp-approx to be installed via

devtools::install_github("paul-buerkner/brms", ref = "gp-approx")

Here is an example from the case-study implemented in brms using 5 basis functions:

Fun <- function(x) 1/10*(x/10+1.7)^3 - 2/5*(x/10+1.7)^2 + 1/5*sin((x/10+1.7)^3)
x <- seq(-5,5,0.1) 
N <- length(x)
set.seed(4321)
y <- Fun(x) + rnorm(N, 0, 0.08)
dat <- data.frame(y, x)

fit1 <- brm(y ~ gp(x, k = 5), dat)
summary(fit1)
plot(marginal_effects(fit1), points = TRUE)

Convergence is not ideal, but this may be a problem of the example data set and not necessarily related to my implementation, which mirrors that of the case-study.

A few things to discuss:

Default of L is currently 5/4 * (max(x) - min(x)) corresponding to the choice in the case study. Is there any theoretical reason for this choise?
I named the number of basis function k in gp() for consistency with splines in brms. Any objection to this naming choice?

paul-buerkner commented 6 years ago

A prototype of multidimensional approximate GPs should now work as well. Here is a small example:

dat <- mgcv::gamSim(2, n = 120, scale = 0.3)$data
fit <- brm(y ~ gp(x, z, k = 5), data = dat, chains = 1)
marginal_effects(fit, "x:z")
marginal_effects(fit, "x:z", surface = TRUE, resolution = 30)

paul-buerkner commented 6 years ago

I have a question for @gabriuma about multidimensional approximate GPs:

In your stan code you write M_nB is M * D. Does this imply that the basis functions of the multidimensional GP are just the basis functions of the univariate GPs bound together into a matrix with more columns? To what kind of multidimensional splines does that correspond (addtive, multiplicative, or something else)?

gabriuma commented 6 years ago

Hi @paul-buerkner,

@avehtari @MichaelRiis,

We are working now on the study for the suitable values of L and its implications. Meanwhile, the default of L as |5/4 * (max(x) - min(x))| seems to be appropiate. Inputs x has to enter into the model always individually centered; they need to meet x \in [-L,L].

About your last comment, the basis functions in the multidimensional case are not just bounding together the univariate basis functions. M^D new basis functions have to be created. Each new one is the product of D univariate basis functions. One example:

M=2 (two basis functions) D=3 (three dimensions)

Univariate basis functions: bf11, bf12, bf13, bf21, bf22, bf23

New (multivariate) basis functions: bf1 = bf11 . bf12 . bf13 bf2 = bf11 . bf12 . bf23 bf3 = bf11 . bf22 . bf13 bf4 = bf11 . bf22 . bf23 bf5 = bf21 . bf12 . bf13 bf6 = bf21 . bf12 . bf23 bf7 = bf21 . bf22 . bf13 bf8 = bf21 . bf22 . bf23

Also, there are some changes in the spectral density and the lambda expressions. You have all of these defined as functions in the Stan code:

spd_nD: is the expression for the spectral density in the n-dimensional case. lambda_nD: is the expression for the lambda in the n-dimensional case. phi_nD: computes a new single multivariate basis function, as a function of the required combination.

In the data block of the Stan code, there is a matrix called 'indices'. This matrix contains the indices of the different combinations. Each row contains the combination to form a new basis function.

I put in github R-code with examples in 2D and 3D. This will be, I think, easy to run. https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/multi_dimensional/R-code_example.r https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/multi_dimensional/diabetes_dataset/diabetes.csv

I have the intention to do a notebook with the multivariate case soon, but not immediately.

The computational expenses of the basis function approach is in the order of O(M^D * (N-1)), where N is the number of observations. More than 10 basis functions with 3 dimensions is starting to be a bit slow.

If I did not express myself something well, please tell me:)

paul-buerkner commented 6 years ago

Thanks, @gabriuma for the detailed explanation! That makes a lot of sense and is actually how I implemented it in brms based on your Stan code. I was just slightly confused by your comment in the Stan code that M_nD is M * D not M^D and so I wanted to make sure I didn't completely misunderstand the approach.

I will add automatic centering of covariates to brms, because we can't expect users to always do it themselves.

gabriuma commented 6 years ago

Oh sorry! Maybe I wrote you more than needed. Anyway, I suppose it can be useful.

Yes, I agree automatic centering of covariates will be recommendable.

El mié., 31 oct. 2018 a las 9:31, Paul-Christian Bürkner (< notifications@github.com>) escribió:

Thanks, @gabriuma https://github.com/gabriuma for the detailed explanation! That makes a lot of sense and is actually how I implemented it in brms based on your Stan code. I was just slightly confused by your comment in the Stan code that M_nD is M * D not M^D and so I wanted to make sure I didn't completely misunderstand the approach.

I will add automatic centering of covariates to brms, because we can't expect users to always do it themselves.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/540#issuecomment-434588216, or mute the thread https://github.com/notifications/unsubscribe-auth/Acl8mphTkod-bN9XNlVnrTibvAKS1hIdks5uqVHAgaJpZM4X7yIG .

paul-buerkner commented 6 years ago

No worries! My understanding of all of this is still limited so I appreciate detailed explanations.

Automatic centering is now done as well.

paul-buerkner commented 6 years ago

There is one more issue, I want to discuss with you (@gabriuma, @avehtari), which is the isotropy of multivariate Gaussian processes (i.e. same smoothing of all covariates). This basically translates to having only a single length-scale parameter for all covariates within the same GP. Currently, brms makes this assumption in both exact and approximate GPs, but I think it's worth generalizing that.

In your code, I have seen you allowing for different length-scale parameters per covariate in the approximate GPs, but it was not completely clear to me to which exact GP parameterization this corresponds (i.e. how the covariance kernel looks like). Can you help me out with this?

avehtari commented 6 years ago

In terms of GPML sigmam^2 exp(- 1/2 \sum{d=1}^D r_d^2 / l_d^2) where l_d is the length scale for dimension d

paul-buerkner commented 6 years ago

Thanks! That looks reasonable. I hope to have non-isotropic (exact and approximate) GPs working at some point next week.

gabriuma commented 6 years ago

In function 'gp_123' in the Stan code file https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/multi_dimensional/diabetes_dataset/model_GP_multi.stan there are two ways to compute the multivariate squared exponential covariance function with different lengthsales: (1) Through the use of the function 'cov_exp_quad' (Line 47). (2) Writting the full expression of the covariance function (Lines 50-59). These both are for an specific and known number of dimensions (lengthscales). A proposal to generalize (1) to any number D of dimensions could be: cov = cov_exp_quad(X[,1], sdgp, lscale[1]) for (i in 1:D) cov = cov .* cov_exp_quad(X[,d], 1, lscale[d])

And for (2): for (i in 1:N) for(j in 1:N) cov[i,j] = sdgp^2 exp(-1/2 (X[i,]-X[j,])^2 /* lscale)

where X is the NxD matrix of inputs and lscale is the D-vector of lengthscales

paul-buerkner commented 5 years ago

Thanks for your input! Non-isotropic exact and approximate GPs should now be working. I would greatly appreciate if you could run some tests as well. As a start, here is a simple data set with several ways to fit splines and GPs:

set.seed(1245)
dat <- mgcv::gamSim(4, n = 90, scale = 2)

# isotropic spline
fit1 <- brm(y ~ s(x1, x2, by = fac), dat, chains = 2)
summary(fit1)
bayes_R2(fit1)
marginal_effects(
  fit1, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  resolution = 20, surface = TRUE
)

# non-isotropic spline
fit2 <- brm(y ~ t2(x1, x2, by = fac), dat, chains = 2)
summary(fit2)
bayes_R2(fit2)
marginal_effects(
  fit2, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  resolution = 20, surface = TRUE
)

# isotropic exact GP
fit3 <- brm(y ~ gp(x1, x2, by = fac), dat, chains = 2)
summary(fit3)
bayes_R2(fit3)
marginal_effects(
  fit3, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  nsamples = 300, resolution = 20, surface = TRUE
)

# non-isotropic exact GP
fit4 <- brm(y ~ gp(x1, x2, by = fac, iso = FALSE), dat, chains = 2)
summary(fit4)
bayes_R2(fit4)
marginal_effects(
  fit4, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  nsamples = 300, resolution = 20, surface = TRUE
)

# isotropic approximate GP
fit5 <- brm(y ~ gp(x1, x2, by = fac, k = 10), dat, chains = 2)
summary(fit5)
bayes_R2(fit5)
marginal_effects(
  fit5, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  resolution = 20, surface = TRUE
)

# non isotropic approximate GP
fit6 <- brm(y ~ gp(x1, x2, by = fac, k = 10, iso = FALSE), dat, chains = 2)
summary(fit6)
bayes_R2(fit6)
marginal_effects(
  fit6, "x2:x1", conditions = data.frame(fac = unique(dat$fac)),
  resolution = 20, surface = TRUE
)

avehtari commented 5 years ago

After re-installing with

devtools::install_github("paul-buerkner/brms", ref = "gp-approx")

I get an error

No matches for: 

  gp(vector[], real, real, vector)

Available argument signatures for gp:

  gp(vector[], real, vector, vector)

Do I need to update something else?

paul-buerkner commented 5 years ago

This should be working. How does the formula of your model look like?

Am Di., 13. Nov. 2018 um 11:36 Uhr schrieb Aki Vehtari < notifications@github.com>:

After re-installing with

devtools::install_github("paul-buerkner/brms", ref = "gp-approx")

I get an error

No matches for:

gp(vector[], real, real, vector)

Available argument signatures for gp:

gp(vector[], real, vector, vector)

Do I need to update something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/540#issuecomment-438218018, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtAMWe8OZbif8UkQKuPL5-9YjHOWpHks5uuqC2gaJpZM4X7yIG .

avehtari commented 5 years ago

I'm trying to run your example

fit3 <- brm(y ~ gp(x1, x2, by = fac), dat, chains = 2)

avehtari commented 5 years ago

Might have been a problem due not restarting R after re-installing brms

paul-buerkner commented 5 years ago

Seems likely, because it works for me in that branch.

avehtari commented 5 years ago

With by = fac are GPs for different factors completely independent or is there hierarchical model?

paul-buerkner commented 5 years ago

They are completely independent for now, but at some point, we should add hierarchical GPs as well. There is even an issue for it, I believe: #412

avehtari commented 5 years ago

waic(fit1, fit2, fit3, fit4, fit5, fit6) That comparison is not reliable. Waic is failing badly, but when called with several fit objects it's not reporting the diagnostic warnings...

paul-buerkner commented 5 years ago

I know. I did accidentally copy it over from my test script and I have now removed it from the above example.

gabriuma commented 5 years ago

Hi @paul-buerkner I added the Leukemia example where we use hierarchical GP and hierarchical Basis functions. https://github.com/gabriuma/basis_functions_approach_to_GP/tree/master/multi_dimensional/Leukemia_dataset I hope it will be useful for you to implement hierarchical GP and approximate GP in brms package. If there are too many things in the script and Stan-codes, I can simplifly them for you. Just tell me.

I also added two graphs (both are the same just with different zoom) https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/Graph_relation_M_lengthscale_L.png https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/Graph_zoom_relation_M_lengthscale_L.png that show the expected relation between the lengthscale to half-data-range ratio (rho/S), the number of basis functions M, and the boundary condition L (L= c·S) (S= half-data-range). It has been obtained in the one-dimensional case, although it can be useful as well in the multi-dimensional case. This graph seems to be useful although remains some further discussion and possible further studies.

paul-buerkner commented 5 years ago

Thanks! Can you explain the plots to me once more? The title says "difference between GP and BF ..." but I see no such difference in the plot. Apparently, for smaller rho (relative to S) we need more basis functions (to approximate GPs appropriately?), but I don't see the "criterion" that tells me for what I need those basis functions. Perhaps I am misunderstanding something

MichaelRiis commented 5 years ago

The graphs indicate when the area between the true covariance function and the approximate covariance function are smaller than some threshold. That is, how many basis functions do we need in order satisfy the condition: int |cov_true(x) - cov_approx(x)| dx < threshold.

Hope this clarifies things a bit.

paul-buerkner commented 5 years ago

That clarifies it, thanks!

paul-buerkner commented 5 years ago

I took a look at the hierarchical example and it doesn't seem so hierarchical to me, but maybe our definition of hierarchical varies. How I read your code is as follows:

gp(sex1) = gp_all + gp_1
gp(sex2) = gp_all

Naively, I would expect something along the lines of

gp(sex1) = gp_all + gp_1
gp(sex2) = gp_all + gp_2

With some penalty on (the parameters of) gp_1 and gp_2 to make them identified together with the "mean" Gaussian process gp_all. At least, this is how most hierarchical splines are working to my understanding.

gabriuma commented 5 years ago

Hi, If you put the constraint gp_2=0, you obtain our code

El vie., 16 nov. 2018 a las 16:28, Paul-Christian Bürkner (< notifications@github.com>) escribió:

I took a look at the hierarchical example and it doesn't seem so hierarchical to me, but maybe our definition of hierarchical varies. How I read your code is as follows:

gp(sex1) = gp_all + gp_1 gp(sex2) = gp_all

Naively, I would expect something along the lines of

gp(sex1) = gp_all + gp_1 gp(sex2) = gp_all + gp_2

With some penalty on (the parameters of) gp_1 and gp_2 to make them identified together with the "mean" Gaussian process gp_all. At least, this is how most hierarchical splines are working to my understanding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/540#issuecomment-439409829, or mute the thread https://github.com/notifications/unsubscribe-auth/Acl8mo7RDSYfTrcIvyK96HwjyeHbF-Xrks5uvsuZgaJpZM4X7yIG .

avehtari commented 5 years ago

but maybe our definition of hierarchical varies.

The way we have defined needs less parameters and follows the approach used for factor valued covariates, e.g. in stan_glm. One of the factors is chosen to be the baseline. Having levels+1 GPs would require additional constraint which can make sampling more difficult.

paul-buerkner commented 5 years ago

It's clear that we can achieve equality if we set gp_2 = 0, but that's not what I had in mind. It's more of what Aki pointed to of having levels + 1 GPs. The other thing is not really "hierarchical" to me, but I guess it's indeed just a matter of definition. will think of this more thoroughly, but since it doesn't need to be part of the PR, we are not in a hurry I guess.

The weekend, I will round-up the approximate and non-isotropic GP stuff and make a PR to master so we got this going for the next release of brms. Thank you both so much for your help!

gabriuma commented 5 years ago

Thank you so much to you, Paul!

El vie., 16 nov. 2018 a las 18:28, Paul-Christian Bürkner (< notifications@github.com>) escribió:

It's clear that we can achieve equality if we set gp_2 = 0, but that's not what I had in mind. It's more of what Aki pointed to of having levels + 1 GPs. The other thing is not really "hierarchical" to me, but I guess it's indeed just a matter of definition. will think of this more thoroughly, but since it doesn't need to be part of the PR, we are not in a hurry I guess.

The weekend, I will round-up the approximate and non-isotropic GP stuff and make a PR to master so we got this going for the next release of brms. Thank you both so much for your help!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/540#issuecomment-439448287, or mute the thread https://github.com/notifications/unsubscribe-auth/Acl8moI_i6N89hJdovWl5SotROuhZuItks5uvuewgaJpZM4X7yIG .

MichaelRiis commented 5 years ago

Rather than "constraining gp_2=0", I think about it like this:

Our implementation is just a re-parametrization of the "standard hierarchical model",

gp(sex1) = gp_all + gp_1 gp(sex2) = gp_all + gp_2

in the following way

gp(sex1) = gp_all + gp_1 = f1 gp(sex2) = gp_all + gp_2 = (gp(sex1) - gp_1) + gp_2 = gp(sex1) + (gp_2 - gp_1) = f1 + f_diff,

where we estiamte f1 and f_diff rather than gp_all, gp_1 and gp_2.

paul-buerkner commented 5 years ago

Yeah, I see that idea, but I am not convinced of calling it hierarchical in the sense of how we usually use that term in statistics. It may lead to confusion if we sell it like this.

Anyway, one can already achieve this kind of GP model in brms, although perhaps not most efficiently:

brm(y ~ gp(x) + gp(x, by = z), ...)

where z is a numeric variable with 0 for the reference level and 1 for the other level. Here is an example:

set.seed(1245)
dat <- mgcv::gamSim(4, n = 90, scale = 2)
dat$z2 <- as.numeric(dat$fac == 2)
dat$z3 <- as.numeric(dat$fac == 3)

library(brms)
fit <- brm(
  y ~ gp(x2, k = 10) + gp(x2, by = z2, k = 10) + gp(x2, by = z3, k = 10), 
  dat, chains = 2
)
summary(fit)
conds <- data.frame(z2 = c(0, 1, 0), z3 = c(0, 0, 1))
marginal_effects(fit, "x2", conditions = conds)

paul-buerkner commented 5 years ago

Ok, so here is a proposal to bring in and generalize your contrast GP approach.

The default behavior of handling factor by-variables in GPs is basically cell-mean coding, that is each level gets its own GP. The approach you call 'hierarchical' is basically contrast/dummy coding, but of course, we could use any other coding as well, for instance, sum/effect coding, which is just a hard sum-to-zero constraint.

Each factor variable in R can have a contrast attribute to define the contrasts being used and if not specified, default coding (i.e., dummy coding) will be used.

So my proposal is to add an argument cmc to gp() defaulting to TRUE and indicating whether cell-mean coding (TRUE) or other coding based on the specified contrasts (FALSE) should be used.

To improve efficiency, we could try to deselect zero-entries before computing GPs (as you do in your Stan code).

The above example with dummy coding would be

fit <- brm(y ~ gp(x2, by = fac, k = 10, cmc = FALSE), dat, chains = 2)

The above example with effect coding would be

contrasts(dat$fac) <- contr.sum(3)
fit <- brm(y ~ gp(x2, by = fac, k = 10, cmc = FALSE), dat, chains = 2)

What would you think of this approach?

paul-buerkner commented 5 years ago

The contrast GP approach is now working. So basically the models in my former post should now be working.

paul-buerkner commented 5 years ago

The PR is merged so I am closing this issue now. Feel free to continue using it for discussion around GPs in brms. If we need to make bigger changes, we may also just open it again.

paul-buerkner / brms

Basis function approximation to Gaussian processes #540