greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
527 stars 63 forks source link

greta goes NUTS #314

Open jeffreypullin opened 5 years ago

jeffreypullin commented 5 years ago

Tensorflow probability has just open sourced it's unrolled* NUTS implementation here

Personally, I think having a NUTS implementation would be a big thing for greta.

*I assume this means that it very parallel(y)

goldingn commented 5 years ago

Yeah, this'll be great. I have a suspicion it'll be a fair bit slower than the HMC implementation. But will hopefully make up for that in efficiency.

We'll need to wait for the next TFP release before we can put this in a greta release or the master branch, but it should be fairly straightforward to implement in a branch for now. Just need to copy and modify the hmc_sampler class to create a nuts_sampler class, provide a nuts() constructor like the hmc() constructor, and probably modify travis and DESCRIPTION to use the TFP nightly builds.

goldingn commented 5 years ago

I just pushed up a nuts branch which implements an interface to the new sampler. You can install the required nightly TF stuff with some variant of:

install_tensorflow(version = "nightly")

It looks like the API might be different (or buggy), as it's erroring for me (when I run mcmc(m, nuts())) with this message:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Cannot convert a partially known TensorShape to a Tensor: (11, ?, 1)

Will try to take a look at that soon.

I haven't changed any of the Travis instructions, so expecting it to fail there too.

jeffreypullin commented 5 years ago

I’m happy to look into that error/the interface if you would like - I should have some time this weekend. (I was actually going to offer to submit a pull request to add the nuts sampler - but you beat me to that! :blush:)

goldingn commented 5 years ago

Great, please do! And let me know if you need help navigating the batch dimension stuff. It isn't very intuitive.

goldingn commented 5 years ago

I couldn't help myself, so here's a reprex of the bug in R tensorflow.

The error doesn't occur when the initial values have known dimensions, so it looks like either some unaccounted for broadcasting issue, or maybe the error is an artefact of using placeholders (a TF 1 concept) rather than eager execution and tf_function. If the latter, it'll require a significant amount of work to make this work with greta (though something that will need doing eventually anyway).

library(tensorflow)
library(tfprobability)

# Target distribution is proportional to: `exp(-x (1 + x))`.
unnormalized_log_prob <- function (x) {
  tf$reduce_sum(-x - x ^ 2, axis = 1L)
}

run_chain <- function (kernel) {

  # Run the chain (with burn-in).
  tfp$mcmc$sample_chain(
    num_results = 10L,
    current_state = vals,
    kernel = kernel,
    trace_fn = function (current_state, kernel_results) {kernel_results}
  )

}

# set up kernels
hmc <- tfp$mcmc$HamiltonianMonteCarlo(
    target_log_prob_fn = unnormalized_log_prob,
    num_leapfrog_steps = 3L,
    step_size = 1
)

nuts <- tfp$mcmc$NoUTurnSampler(
  target_log_prob_fn = unnormalized_log_prob,
  max_tree_depth = 3L,
  step_size = 1
)

vals <- tf$compat$v1$placeholder(dtype = tf$float32, shape = shape(NULL, 100))

hmc_output <- run_chain(hmc)

# this errors because of the unknown dimension!
nuts_output <- run_chain(nuts)
 Error in py_call_impl(callable, dots$args, dots$keywords) : 
 ValueError: Cannot convert a partially known TensorShape to a Tensor: (4, ?, 100) 

For completeness, here's the code to execute the inference for HMC:

# defining the number of chains now, via initial values
n_chains <- 3
init <- matrix(rnorm(100 * n_chains), n_chains, 100)
feed_dict <- dict(vals = init)

sess <- tf$compat$v1$Session()
system.time(res <- sess$run(hmc_output, feed_dict = feed_dict))
junpenglao commented 5 years ago

I think this is an error with dynamic shape which we have never tested in unrolled NUTS - I just filed a bug in TFP side: https://github.com/tensorflow/probability/issues/530 I will prepare a fix.

goldingn commented 5 years ago

Awesome, thanks @junpenglao!

goldingn commented 5 years ago

Other things we still need to do on the greta side:

A speed test of NUTS vs vanilla HMC (using the above code but fixed shape) shows a fairly major slowdown on a small dataset and with few chains. That's to be expected given how much more complicated the unrolled algorithm must be to make a static graph and vectorised NUTS! Though I would expect there are many cases (large numbers of chains, computationally-intensive objective function, complex posterior geometry) where NUTS is more efficient. I'd be keen to try this method of tuning the marginal distribution of L and see how it compares with vanilla HMC and NUTS.

junpenglao commented 5 years ago

In our internal test NUTS is at most 2x slower than HMC (note that to compare the two fairly, you need to make sure both are taking roughly the same number of leapfrogs), you can narrow the gap by using large numbers of chains, and run it under XLA.

Note that you can already do something similar to eHMC: first run NUTS, and using the histogram of leapfrog_computed (you can get it from the previous_kernel_result) as the empirical distribution of num_leapfrog in HMC (will need to modify tfp.mcmc.hmc to accept a callable as num_leapfrog).

goldingn commented 5 years ago

That's really useful, thanks!

Yes, was thinking about how to modify hmc to jitter the number of leapfrog steps. Is is also possible to do this using the trace_fn argument to sample_chain, modifying num_leapfrog before returning it for the next step?

junpenglao commented 5 years ago

Yes, but you will need to write another transitional kernel that wrap around hmc, similar to https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/simple_step_size_adaptation.py but modify the number of leapfrog step instead.

junpenglao commented 5 years ago

Actually, a bit hacky way to do it is modify hmc.py and change num_leapfrog_steps arg to a callable num_leapfrog_steps_fn, and pass a random function to it:

kernel = HamiltonianMonteCarlo(
    target_log_prob_fn=tfd.Independent(tfd.Normal(tf.zeros(10), 1.), 1).log_prob,
    num_leapfrog_steps_fn=lambda _: tfd.Categorical(
        # here we have our empirical distribution for L
        probs=np.asarray([.15, .15, .5, .15, .05])).sample() + tf.constant(3, dtype=tf.int32),
    step_size=1.)

@tf.function
def run_chain():
  # Run the chain (with burn-in).
  samples, is_accepted = tfp.mcmc.sample_chain(
      num_results=1000,
      num_burnin_steps=100,
      current_state=tf.zeros(10),
      kernel=kernel,
      trace_fn=lambda _, pkr: pkr.is_accepted)

  sample_stddev = tf.math.reduce_std(samples)
  is_accepted = tf.reduce_mean(tf.cast(is_accepted, dtype=tf.float32))
  return samples, sample_stddev, is_accepted

samples, sample_stddev, is_accepted = run_chain()

You can also do dynamic number of step with fix path length which is the PyMC3 current HMC:

path_length = 5.
kernel = HamiltonianMonteCarlo(
    target_log_prob_fn=tfd.Independent(tfd.Normal(tf.zeros(10), 1.), 1).log_prob,
    num_leapfrog_steps_fn=lambda step_size: tf.cast(path_length/step_size, tf.int32),
    step_size=1.)
junpenglao commented 5 years ago

FYI, dynamic shape should be fixed now: https://github.com/tensorflow/probability/commit/23573a10f203bf740e69f52387bbe8070703eda5

goldingn commented 5 years ago

Awesome, thanks for fixing that so quickly!

I get an error about shape invariants when using this in greta with the nuts branch:

library(greta)
x <- normal(0, 1)
m <- model(x)
draws <- mcmc(m, nuts())
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Input tensor 'mcmc_sample_chain/trace_scan/while/smart_for_loop/while/NoUTurnSampler/.one_step/mul:0' enters the loop with shape (2, ?, 1), but has shape (2, ?, ?) after one iteration. To allow the shape to vary across iterations, use the `shape_invariants` argument of tf.while_loop to specify a less-specific shape. 

But I don't get an error when doing this with the R tensorflow example above, so there's something funky going on and may be a problem on the greta side. Should be able to take a look at it this week

goldingn commented 5 years ago

Oh, it looks like this is an issue (in the R tensorflow code also) when the input dimensions are ?, 1. so probably still a problem with dynamic shape @junpenglao?

This code is as before, but I changed from shape(NULL, 100) to shape(NULL, 1) when defining the placeholder for the parameters

library(tensorflow)
library(tfprobability)

# Target distribution is proportional to: `exp(-x (1 + x))`.
unnormalized_log_prob <- function (x) {
  tf$reduce_sum(-x - x ^ 2, axis = 1L)
}

run_chain <- function (kernel) {

  # Run the chain (with burn-in).
  tfp$mcmc$sample_chain(
    num_results = 10L,
    current_state = vals,
    kernel = kernel,
    trace_fn = function (current_state, kernel_results) {kernel_results}
  )

}

# set up kernels
hmc <- tfp$mcmc$HamiltonianMonteCarlo(
  target_log_prob_fn = unnormalized_log_prob,
  num_leapfrog_steps = 3L,
  step_size = 1
)

nuts <- tfp$mcmc$NoUTurnSampler(
  target_log_prob_fn = unnormalized_log_prob,
  max_tree_depth = 3L,
  step_size = 1
)

vals <- tf$compat$v1$placeholder(dtype = tf$float32, shape = shape(NULL, 1))

unnormalized_log_prob(vals)
hmc_output <- run_chain(hmc)

# this errors because of the unknown dimension!
nuts_output <- run_chain(nuts)
 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Input tensor 'mcmc_sample_chain_1/trace_scan/while/smart_for_loop/while/NoUTurnSampler/.one_step/mul:0' enters the loop with shape (2, ?, 1), but has shape (2, ?, ?) after one iteration. To allow the shape to vary across iterations, use the `shape_invariants` argument of tf.while_loop to specify a less-specific shape. 
junpenglao commented 5 years ago

Thanks for reporting, I will take a look.

goldingn commented 5 years ago

For now, it looks like the nuts branch works with the nightlies, at least provided the parameters space is larger than 1! Here's a sampler efficiency comparison on my laptop with a 1000D standard normal

set.seed(2019-08-27)

library(greta)
x <- normal(0, 1, dim = 1000)
m <- model(x, precision = "single")

# get (minimum) effective samples per second during sampling (don't time warmup period)
efficiency <- function(sampler) {
  draws <- mcmc(m, sampler, n_samples = 1, verbose = FALSE)
  time <- system.time(
    draws <- extra_samples(draws, 1000, verbose = FALSE)
  )
  size <- coda::effectiveSize(draws)
  min(size) / time["elapsed"]
}

hmc_efficiency <- efficiency(hmc())
nuts_efficiency <- efficiency(nuts())
rwmh_efficiency <- efficiency(rwmh())
slice_efficiency <- efficiency(slice())

efficiencies <- round(c(hmc_efficiency, nuts_efficiency, rwmh_efficiency, slice_efficiency), 1)
names(efficiencies) <- c("hmc", "nuts", "rwmh", "slice")
efficiencies
  hmc  nuts  rwmh slice 
279.8 137.2   2.8   0.8 

So still paying enough of a cost in the NUTS code to halve the number of effective samples in this example. Could well be worth that cost in an example with more variable geometry.

junpenglao commented 5 years ago

Hmmm I cant seems to be able to reproduce the error in python, is the code you are running in R spell out correctly like this in python? https://colab.research.google.com/drive/16Fu1kRdf11PeRy3kqjgdMsWqDo7FQPtV

I tested in tf1 and tf2

junpenglao commented 5 years ago

What TF version you are currently using?

goldingn commented 5 years ago

I just left the office and won't be able to have a play until tomorrow unfortunately.

But reading from my phone, the main difference is that you are using a placeholder with a default, whereas I was running without a default.

Otherwise the only difference I see is is the objective using integer 2 instead of float 2 ( 2 in R is 2. in python)

goldingn commented 5 years ago

I was running with the current TF1 nightly (cpu, 1.15.?)

junpenglao commented 5 years ago

Thanks! Let me try with the one without default

junpenglao commented 5 years ago

Yep can confirm the bug using:

batch_shape = 10

def unnormalized_log_prob(x):
  return tf.reduce_sum(-x-x**2, axis=-1)

for shape in [1, 100]:
#   x_init = tf1.placeholder_with_default(np.zeros((batch_shape, shape), dtype=np.float32), shape=None)
  x_init = tf1.placeholder(dtype=tf.float32, shape=tf.TensorShape([None, 1]))
  unnormalized_log_prob(x_init)

  def run_chain():
    kernel1 = tfp.mcmc.NoUTurnSampler(
        unnormalized_log_prob,
        step_size=0.3)
    trace = tfp.mcmc.sample_chain(num_results=7,
                                  current_state=x_init,
                                  kernel=kernel1,
                                  trace_fn=None)
    return trace

  if tf.executing_eagerly():
    trace = tf.function(run_chain, autograph=False)()
  else:
    with tf1.Session() as sess:
      trace = sess.run(run_chain(), feed_dict={x_init: np.zeros((batch_shape, shape), dtype=np.float32)})
  print(trace.shape)
junpenglao commented 5 years ago

OK, I see what happen - so the tensors lost its shape in loop_tree_doubling of NUTS, and it could be fixed by wrapping the output of the function with tf.ensure_shape. However, I am not sure this is worth as tf.placeholder will be deprecated in TF2. @goldingn How difficult would it be to update your codebase from tf.placeholder to tf.placeholder_with_default? You can use a zeros tensor and give the batch shape an arbitrary number.

skeydan commented 5 years ago

Hi @goldingn I know I'm late to the party but I'm adding NUTS here: https://github.com/rstudio/tfprobability/pull/75

(((just in case this makes your testing easier)))

goldingn commented 5 years ago

Yeah, it should be fine too use placeholder with default in future versions of greta. I'll implement that on the nuts branch soon to make sure. Thanks @junpenglao!

goldingn commented 5 years ago

Thanks @skeydan! Looking forward to tfprobability hitting cran so we can depend on it in future releases

skeydan commented 5 years ago

We'll work on it!

goldingn commented 5 years ago

@junpenglao this works if I set the placeholder-with-default with completely undefined shape, like:

vals <- tf$compat$v1$placeholder_with_default(
  tf$zeros(shape(1, 1), dtype = tf$float32),
  shape = NULL
)

but not if I provide a default but partially-defined shape:

tf$compat$v1$placeholder_with_default(
  tf$zeros(shape(1, 1), dtype = tf$float32),
  shape = shape(NULL, 1)
)

Same error as before. Only errors when the second dimension is 1, fine if it's anything else.

Equivalent code in your Python example would be:

tf1.placeholder_with_default(np.zeros((batch_shape, shape), dtype=np.float32), shape=tf.TensorShape([None, 1]))

Using placeholder-with-default is fine in greta, but having the tensors have completely undefined shape would definitely be a problem - there are a lot of functions that operate based on the dimensions of the input.

How difficult do you think it would be to get NUTS working with partially-known shape like the other samplers?

junpenglao commented 5 years ago

I see, yep this is a valid point and definitively need a fix - thank you for digging in! I will send out a fix shortly :-)

goldingn commented 5 years ago

Thanks!

junpenglao commented 5 years ago

BTW, I am curious about:

Using placeholder-with-default is fine in greta, but having the tensors have completely undefined shape would definitely be a problem - there are a lot of functions that operate based on the dimensions of the input.

I would imagine that these dimensions of the input in this case is static, as it is the event_shape of the random variable, maybe it is a better practice to pass the static value around instead of inferring the shape at run time? That's also the main idea behind https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/internal/prefer_static.py, basically you try to move as much as possible the static values outside of the graph, so that your graph is smaller (faster to compile and potentially faster to execute as well).

goldingn commented 5 years ago

Right, in greta the shapes are generally (always?) known before run time. The only thing that is dynamic is the first dimension of the variable, which we use to set either the number of chains, or the number of samples when computing posteriors over downstream variables. So all dimensions other than one are static in the graph. That unknown dimension is set in one placeholder, and then propagated through the graph.

But greta uses a bunch of R functions to build up a TF graph automatically for the user based on their model. These R functions are mostly to translate existing mathematical operations in R into TensorFlow. There are quite often situations in those functions (e.g. here) where the dimensions of the tensors are needed to work out how to do the operation. But then those dimensions are static in the graph.

So we use the shapes of the tensors in several places to build the graph, but that's problematic when the placeholder has unknown shape, since everything downstream of that has unknown shape and we can't use these functions that are shape-dependent.

junpenglao commented 5 years ago

I see. Yeah to have it work with none shape you probably need to make those functions take an additional shape argument, which could be quite annoying to specify everywhere. In any case, the pr is merged now so you should have no error any more.

goldingn commented 5 years ago

Thanks @junpenglao! Took me a while to get to this, but the placeholder fix works just fine.

Unfortunately the new sampler is tripping of greta's more stringent sampler tests. the variances and correlation of a bivariate normal appear to be biased. I'll open an issue on the TFP repo to see if this is right, or if my code is wrong somehow.

goldingn commented 5 years ago

Thanks for the fix @junpenglao! Can confirm it's working as expected with greta's posterior tests.

So nuts sampler looks like it's good to go and can be merged into master as soon as nuts is in a TF probability release.

junpenglao commented 5 years ago

WOOHOO!!! We are preparing 0.8 release, so 🔜