probsys / AutoGP.jl

Automated Bayesian model discovery for time series data
https://probsys.github.io/AutoGP.jl/
Apache License 2.0
60 stars 5 forks source link

Simple examples #8

Closed lukego closed 10 months ago

lukego commented 10 months ago

Cool project!

I'm kicking the tires and thought I'd collect some notes here. Apologies in advance if this is a misuse of the repo issues.

I'm using this simple test function to look at the posterior distributions of some simple synthetic data:

import AutoGP

function top_models(xs, ys; n=3, verbose=false)
    model = AutoGP.GPModel(xs, ys; n_particles=16);
    AutoGP.seed!(6)
    AutoGP.fit_smc!(model; schedule=AutoGP.Schedule.linear_schedule(length(xs), .10), n_mcmc=75, n_hmc=10, verbose=verbose)
    weights = AutoGP.particle_weights(model)
    kernels = AutoGP.covariance_kernels(model)
    logml = AutoGP.log_marginal_likelihood_estimate(model)
    println("Log-ML: $(logml)\n")
    best = sort(zip(kernels, weights), rev=true, by=x->x[2])
    for (i, (k, w)) in enumerate(best[1:n])
        println("Model $(i), Weight $(w)")
        Base.display(k)
    end
end

y=x

A line without noise is inferred pretty quickly. The log-ML estimate should have domain (-∞, 0] though, right?

(I haven't learned to read the parameters in the kernels yet. I'm assuming they are somehow standardized and make sense for now.)

julia> @time top_models(Vector(1:100), Vector(1:100))
Log-ML: 282.2694850370343

Model 1, Weight 0.14821784601368354
LIN(0.37; 0.03, 0.58)

Model 2, Weight 0.08312684024116766
LIN(0.23; 0.20, 0.53)

Model 3, Weight 0.08073108954280904
LIN(0.38; 0.18, 1.48)

 56.960039 seconds (659.07 M allocations: 132.846 GiB, 24.89% gc time)

y=x^2, y=x^3

Polynomials likewise.

Should it be a surprise to see models with extra sum terms so heavily weighted at n=100 and without noise? Maybe not: I need to read about the default priors on model complexity.

julia> @time top_models(Vector(1:100), Vector(1:100).^2)
Log-ML: 271.9224429110566

Model 1, Weight 0.10509916636060361
×
├── +
│   ├── LIN(0.66; 0.38, 0.31)
│   └── LIN(0.16; 0.05, 0.24)
└── LIN(0.12; 0.21, 1.24)

Model 2, Weight 0.10355590097846587
×
├── LIN(0.46; 0.08, 0.13)
└── LIN(0.03; 0.33, 1.80)

Model 3, Weight 0.09826188261078113
×
├── LIN(0.09; 0.77, 0.61)
└── LIN(0.25; 0.11, 1.06)

160.866469 seconds (1.76 G allocations: 425.998 GiB, 26.03% gc time)
julia> @time top_models(Vector(1:100), Vector(1:100).^3)
Log-ML: 214.82244718394827

Model 1, Weight 0.20730654884622823
×
├── LIN(0.04; 0.26, 1.07)
└── ×
    ├── LIN(0.17; 1.12, 0.48)
    └── LIN(0.13; 0.14, 1.36)

Model 2, Weight 0.12859851725732885
×
├── LIN(0.10; 0.39, 0.54)
└── +
    ├── LIN(1.28; 0.38, 0.07)
    └── ×
        ├── LIN(0.17; 0.59, 0.28)
        └── ×
            ├── LIN(0.02; 0.62, 0.85)
            └── LIN(0.22; 0.07, 0.79)

Model 3, Weight 0.11513375122907542
×
├── LIN(0.20; 0.19, 0.35)
└── ×
    ├── LIN(0.19; 0.25, 0.66)
    └── LIN(0.06; 0.62, 0.53)

140.634843 seconds (1.42 G allocations: 383.902 GiB, 26.66% gc time, 0.05% compilation time)

Uniform random noise

Inference on uniform random noise takes longer and ultimately there doesn't seem to be any suitable model on the GP menu. It would be nice to have a diagnostic for flagging this posterior as being a lousy fit.

julia> @time top_models(Vector(1:100), rand(100))
Log-ML: -12.138325663870692

Model 1, Weight 0.20464732968472485
×
├── GE(1.09, 0.92; 0.23)
└── ×
    ├── ×
    │   ├── PER(0.43, 0.16; 0.10)
    │   └── GE(0.05, 0.57; 0.13)
    └── PER(0.08, 0.11; 0.09)

Model 2, Weight 0.15086263315639292
PER(0.02, 0.51; 0.05)

Model 3, Weight 0.1454059350595094
PER(0.23, 0.90; 0.01)

746.373053 seconds (4.08 G allocations: 1.519 TiB, 18.19% gc time, 0.00% compilation time)

y=k

Constant X values seem to be an awkward edge case that propagates a NaN through the simulation:

julia> @time top_models(Vector(1:100), ones(100))
Log-ML: 0.0

Model 1, Weight NaN
GE(0.05, 0.39; 0.97)

Model 2, Weight NaN
LIN(0.13; 0.78, 0.25)

Model 3, Weight NaN
GE(0.08, 1.18; 0.18)

  3.341694 seconds (15.98 M allocations: 6.689 GiB, 14.88% gc time)

big data

Finally, the default data annealing sampler seems to choke on larger datasets, somewhere in the thousands:

julia> @time top_models(Vector(1:10000), rand(10000), verbose=true)
Running SMC round 1000/10000
Particle Weights: [3.75e-224, 4.75e-205, 1.09e-109, 0.00e+00, 1.00e+00, 1.25e-88, 0.00e+00, 3.31e-320, 0.00e+00, 7.74e-21, 3.03e-47, 0.00e+00, 3.78e-91, 1.33e-91, 3.15e-94, 2.45e-29]
Particle ESS: 0.0625
resampled true
^C

I wonder how to approach this case where compute is scarcer than data? Maybe it is a textbook case of Bayesian optimization and a strategy like Kriging should be used to choose which data points to include in each batch to maximize information gain from one step of the SMC to the next?

(Thanks for indulging my thinking aloud here.)

fsaad commented 10 months ago

Thanks for sharing these interesting findings. Plotting the mean predictions and 95% prediction intervals (as in the tutorials) for each mini-benchmark would also help debugging.

The log-ML estimate should have domain (-∞, 0] though, right?

The log marginal likelihood for a continuous random variable (i.e., the observed time series data) can be positive, because we are taking the log of probability density values.  For example,

import scipy.stats
>>> scipy.stats.norm.logpdf(0, loc=0, scale=1e-5)
10.593986931765556

(I haven't learned to read the parameters in the kernels yet. I'm assuming they are somehow standardized and make sense for now.)

That is correct, the time and variable axes are scaled to [0,1].  You can see the scaled values used to model the data using model.ds_transform and model.y_transform.  The kernel parameters should then be interpreted with respect to these scaled values.

Should it be a surprise to see models with extra sum terms so heavily weighted at n=100 and without noise?

The extra sum does not really matter, because the sum of two linear kernels is itself a linear kernel.  The space of kernel structures being sampled over is highly overparameterized (there are multiple structures that induce the same probability distribution).

Inference on uniform random noise takes longer and ultimately there doesn't seem to be any suitable model on the GP menu.

That is a strange case. We should expect a constant kernel plus noise. Perhaps we do not find it because the default prior over kernels skips the "GP.Constant" structure:

https://github.com/probsys/AutoGP.jl/blob/e81ce37b395f228be29ea444efc2f9ccd2b1a817/src/GP.jl#L594-L597

See what happens if you set node_dist_leaf etc. to have 1 for GP.Constant in those three vectors. You can provide in a custom GP.GPConfig to the GPModel constructor:

gp_config = AutoGP.GPConfig(node_dist_leaf=normalize([1., 1, 0, 1, 1,]), ...)
model = GPModel(ds, y; n_particles=10, config=gp_config)

Constant X values seem to be an awkward edge case that propagates a NaN through the simulation:

Thanks for catching this case---it seems like a numerical issue is being triggered.

Finally, the default data annealing sampler seems to choke on larger datasets, somewhere in the thousands

Try to use a logarithmic annealing schedule, which does more computation at fewer data points:

https://github.com/probsys/AutoGP.jl/blob/e81ce37b395f228be29ea444efc2f9ccd2b1a817/src/Schedule.jl#L46

Since the current implementation uses dense Gaussian processes that scale order n^3, we do not really expect it to handle more few thousand data points.  The final paragraph on Page 9 of the paper discusses some modeling trade-offs we could impose to help scale up the method, e.g., variational GPs and/or state-space model representations.

lukego commented 10 months ago

The space of kernel structures being sampled over is highly overparameterized

I suppose the main use case for AutoGP is forecasting and in this context the posterior-predictive density is what matters. Maybe in practice the individual processes in the posterior would tend to be marginalized out anyway?

I have been reading about "Automatic Model Description" in David Kristjanson Duvenaud's thesis and I find myself trying to interpret the processes in the posterior as concrete hypotheses about the data-generating process.

The log marginal likelihood for a continuous random variable can be positive

Thanks for pointing that out. Even so my mind wasn't at all prepared to see huge marginal likelihood values like 3.8e122. In retrospect that makes sense for high-dimensional noise-free synthetic data. "TIL."

Plotting the mean predictions and 95% prediction intervals (as in the tutorials) for each mini-benchmark would also help debugging.

Indeed. The "surprising" fit for uniform noise does have a pretty reasonable posterior predictive density:

uniform1

See what happens if you set node_dist_leaf etc. to have 1 for GP.Constant in those three vectors

Setting it to 1 didn't have much impact but higher values did. For example with prior weight of 10 the posterior is mostly constants with some lines:

uniform_prior10

Is it interesting that the Periodic processes were squeezed out by Lines here when only the prior weight for Constants was changed? I wonder if that's a Monte Carlo artefact where the MCMC steps are more likely to convert Constants from the prior into Lines than Periodic processes?

I'm surprised that the non-constant models are so competitive. They would seem to be badly handicapped by spreading their prior probability mass thinly across extra parameters. Maybe this makes sense in context though, like the astronomical log-ml estimates. Have to consider that the Constant process is a lousy model of non-Gaussian uniform noise anyway.

Thanks for the stimulating discussion!

I'm also wondering: Have you considered generalizing AutoGP to multivariate models? I know that you considered this problem in your thesis so the feature seems conspicuous in its absence :)

lukego commented 10 months ago

Excuse my replying twice but I had a few extra moments over here in CEST...

I suppose that "Monte Carlo artefact" is not an accurate diagnosis because this is the prior distribution and not just an initial proposal distribution. Boosting the prior probability of Constant models should rightfully also boost the prior probability of models close to them in trans-dimensional RJMCMC space. So no funny business is being observed there.

Maybe the reason that the Constant kernels aren't more competitive is because they don't put much prior probability mass on the appropriate parameter value?

If I understand correctly this is the prior distribution over most kernel parameters:

Plots.density([x for x in AutoGP.Model.transform_log_normal.(rand(Normal(), 100000), -1.5, 1.) if x<=1])

and this might tend to favor more complex models that can find a fit using parameter values close to the peak in the prior:

image

lukego commented 10 months ago

Boosting the prior probability of Constant models should rightfully also boost the prior probability of models close to them in trans-dimensional RJMCMC space

My mistake. The MCMC proposals are simpler than this. The mechanism to convert a Constant kernel to something else (Line, Periodic, etc) would be a Subtree-Replace move sampling the new replacement kernel from the prior. (The MCMC proposals don't fuss about details like a Constant being similar to a Line with a small gradient.)

The update code is here https://github.com/probsys/AutoGP.jl/blob/main/src/inference_smc_anneal_data.jl#L89-L109

I'm surprised the MCMC achieves such high acceptance rates. Maybe it's that sampling the prior works reasonably well with univariate models and that new proposals often represent small tweaks to the original kernel? (It might be interesting to take a "longitudinal" look at the evolution of a particle.)

I imagined the structure accept/reject decision would need to at least be deferred until the HMC moves had a chance to propose improved parameters. Not so. HMC moves happen after (and conditional upon) a structure change being accepted by MH.

Quite an education to read this code!

fsaad commented 10 months ago

Have to consider that the Constant process is a lousy model of non-Gaussian uniform noise anyway.

Good point; I mistakenly thought the noise was Gaussian, not uniform. I agree that we don't have a great model for Constant + Uniform noise in the DSL, which perhaps explains the complex approximations trying to fit this data.

In your plots, extrapolate the forecasts beyond time = 100 on the x-axis to see the long-term trend in each case.

I'm also wondering: Have you considered generalizing AutoGP to multivariate models?

Multivariate could mean a few things here:

One could also consider non-time varying features in each of the above cases too...

All of these settings would have somewhat different modeling approaches for using GPs (multi-input / multi-output, etc.). The most immediate challenge is scalability. The fully-dense AutoGP does not do well beyond a few thousand observations, so we would need some type of modeling approximation (nearest-neighbor GP, state-space models, inducing points, etc.). The second challenge is designing a kernel DSL that works for richer types of data than 1D time series.

I imagined the structure accept/reject decision would need to at least be deferred until the HMC moves had a chance to propose improved parameters. Not so. HMC moves happen after (and conditional upon) a structure change being accepted by MH.

Yes. However, note that it is possible to improve parameters before doing the accept/reject decision (see Section 3.2.3 of the paper and cited works therein). The current Julia implementation does not have this strategy enabled, but in general there is a non-trivial trade-off between accuracy and computational cost, as shown in Figure 4 of that section.

lukego commented 10 months ago

we don't have a great model for Constant + Uniform noise in the DSL

I suppose that the DSL has been designed for modelling aggregate quantities like central tendencies. Uniform random noise is probably not very relevant in that context. I'm more focused on idiosyncratic engineering data so assumptions of normality tend to bite me in the posterior and I always want to start with awkward cases like uniform or skew noise.

Anyway: Here's how a longer forecast looks with n=200 uniform points and prior weight of 5 for constants:

fig-uniform200

The posterior is a mixture of PER, CONST, LIN, and GE. Looks like polynomials are the fly in the ointment here and otherwise they all work pretty well.

All of these settings would have somewhat different modeling approaches

If I may digress, the problem I'm really interested in is structure discovery with respect to hyperparamter optimization surfaces of a dozen or so dimensions.

For example, if we were modelling AutoGP.jl, we would be interested in predicting measured variables like:

based on the values of fixed variables like:

The fitted model would help inform user decisions like

and also developer/release-manager inference like

I'm viewing this as a structure discovery problem where we'll have to settle for a lousy approximation that (automatically) excludes or marginalizes over lots of important variables to find compromise models that are interpretable and tractable for different aspects of the problem. For example a model might just tell you which annealing schedules to consider based on available time verses quantity of data.

Even simple models might be competitive with baseline user behavior like "I'd better not touch the defaults in case it runs forever or gives me the wrong answer." :grin:

This is not the same problem that AutoGP solves but it's not an entirely different genre either, right? It seems to me like a solution to this problem would have a very similar anatomy to AutoGP. Maybe it could even be approached by comparing multiple AutoGP runs for model selection over one-dimensional projections.

If you have any thoughts on that problem then I am all ears.