Custom gradients in HMC?

janosh commented 4 years ago

When passing a target_log_prob_fn that's not built from TF primitives (and hence doesn't allow for automatic differentiation) to the HamiltonianMonteCarlo kernel, is there a way to also pass a custom gradient function? Of course, you lose all the performance benefits of autodif, but this would significantly increase the potential areas of application for TFP's HMC.

SiegeLordEx commented 4 years ago

You can use TF's tf.custom_gradient functionality for this. See, e.g. here https://groups.google.com/a/tensorflow.org/g/tfprobability/c/Vr2605ZuBEY/m/m1abXSxmBwAJ for how to implement gradient clipping (not that you necessarily should clip gradients for HMC).

janosh commented 4 years ago

@SiegeLordEx Thanks for the quick reply. The Google group you linked seems to be private. At least I get a 404.

Regarding tf.custom_gradient, the docs state

Args: f: function f(*x) that returns a tuple (y, grad_fn) where:

x is a sequence of Tensor inputs to the function. y is a Tensor or sequence of Tensor outputs of applying TensorFlow operations in f to x.

That last sentence sounds like it might be a problem. In our case y is just a scalar value which can of course be made into a TF tensor but it's not the result of TF operations. Instead it's the output of a Fortran binary. Does tf.custom_gradient still help us in that case?

csuter commented 4 years ago

Here's the public link: https://groups.google.com/a/tensorflow.org/d/msg/tfprobability/Vr2605ZuBEY/m1abXSxmBwAJ

SiegeLordEx commented 4 years ago

@janosh yes, it should be fine.

Here's another example:

def foo_no_grad(x):
  y = np.square(x)
  return tf.constant(y)

@tf.custom_gradient
def foo_custom_grad(x):
  y = np.square(x)
  def grad_fn(dy):
    grad = 2 * np.array(x)
    return grad * dy
  return y, grad_fn

def foo_autodiff(x):
  y = tf.square(x)
  return y

with tf.GradientTape(persistent=True) as tape:
  x = tf.constant(2., dtype=tf.float64)
  tape.watch(x)
  y1 = foo_no_grad(x)**2
  y2 = foo_custom_grad(x)**2
  y3 = foo_autodiff(x)**2

print(tape.gradient(y1, x))  # => None
print(tape.gradient(y2, x))  # => tf.Tensor(32.0, shape=(), dtype=float32)
print(tape.gradient(y3, x))  # => tf.Tensor(32.0, shape=(), dtype=float32)

Note that both foo_no_grad and foo_custom_grad both use numpy to compute the forward computation (and gradient in the latter case), which is normally not differentiable by TensorFlow's autodiff mechanism. foo_custom_grad, through @tf.custom_gradient, is differentiable, even though both the forward and backwards passes are implemented in numpy. You can naturally replace numpy with arbitrary Python code.

One small final note. The code above is Eager-only. If you want things to work inside a tf.function (which you definitely do want to, for speed reasons), you'll need to wrap your non-TF code in tf.py_function, e.g.:

y2 = tf.py_function(func=foo_custom_grad, inp=[x], Tout=tf.float64)**2

janosh commented 4 years ago

@SiegeLordEx That's great advice! Much appreciated. We tried to implement your approach yesterday and ran into this error:

TypeError: 'tensorflow.python.framework.ops.EagerTensor' object is not callable

Full stack trace

Traceback (most recent call last): File "hmc.py", line 118, in trace_fn=trace_fn, File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 564, in __call__ result = self._call(*args, **kwds) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call self._initialize(args, kwds, add_initializers_to=initializers) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize *args, **kwds)) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected graph_function, _, _ = self._maybe_define_function(args, kwargs) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function capture_by_value=self._capture_by_value), File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func func_outputs = python_func(*func_args, **func_kwargs) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn return weak_wrapped_fn().__wrapped__(*args, **kwds) File "/home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 968, in wrapper raise e.ag_error_metadata.to_exception(e) TypeError: in converted code: hmc.py:27 sample_chain * return tfp.mcmc.sample_chain(*args, **kwargs) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/sample.py:324 sample_chain previous_kernel_results = kernel.bootstrap_results(current_state) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/dual_averaging_step_size_adaptation.py:513 bootstrap_results inner_results = self.inner_kernel.bootstrap_results(init_state) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/nuts.py:437 bootstrap_results init_state) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/internal/leapfrog_integrator.py:368 process_args target_fn, state_parts) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/internal/util.py:265 maybe_call_fn_and_grads result, grads = _value_and_gradients(fn, fn_arg_list, result, grads) /home/reag2/anaconda3/envs/mndo/lib/python3.6/site-packages/tensorflow_probability/python/mcmc/internal/util.py:225 _value_and_gradients result = fn(*fn_arg_list) TypeError: 'tensorflow.python.framework.ops.EagerTensor' object is not callable

We've been trying to find the cause for a while but no luck. Our target_log_prob_fn looks like this

@tf.custom_gradient
def target_log_prob_fn(*param_vals):
    log_likelihood = -penalty(param_vals, param_keys, ref_energies, "_tmp_optimizer")

    def grad_fn(*dys):
        grad = jacobian(param_vals, param_keys, ref_energies, "_tmp_optimizer")
        return dys * grad

    return log_likelihood, grad_fn

param_keys, param_values = prepare_data(mols_atoms, start_params)
param_values = [tf.Variable(x) for x in param_values]

target_log_prob_fn = tf.py_function(
    target_log_prob_fn, inp=param_values, Tout=tf.float64
)

where the penalty function is what calls the Fortran binary:

import numpy as np
from tqdm import trange

import mndo

def penalty(param_vals, param_keys, ref_energies, filename):
    """
    params: dict of params for different atoms
    ref_energies: np.array of ground truth atomic energies
    """
    # mndo expects params to be a dict, constructing that here
    # because TFP' HMC requires param_list to be a list
    params = {key[0]: {} for key in param_keys}
    for key, param in zip(param_keys, param_vals):
        atom_type, prop = key
        params[atom_type][prop] = param

    mndo.set_params(params)
    preds = mndo.calculate(filename)
    pred_energies = np.array([p["energy"] for p in preds])

    diff = ref_energies - pred_energies
    mse = (diff ** 2).mean()

    return mse

jacobian just calls penalty:

def jacobian(param_list, *rest, dh=1e-6):
    grad = np.zeros_like(param_list)

    for i in trange(len(param_list)):
        param_list[i] += dh
        forward = penalty(param_list, *rest)

        param_list[i] -= 2 * dh
        backward = penalty(param_list, *rest)

        de = forward - backward
        grad[i] = de / (2 * dh)

        param_list[i] += dh  # undo in-place changes to params for next iteration

    return grad

The full code is available at https://github.com/janosh/bayes-mndo. If you have any ideas how to troubleshoot this, we'd love to hear them!

SiegeLordEx commented 4 years ago

I see two errors.

First, this line: return dys * grad is probably wrong since dys is a tuple, and grad is a numpy array. You'll need to adjust it as appropriate to your problem. The return value of grad_fn should be the vector-Jacobian product dys @ J.

The actual error you're hitting is that tf.py_function, despite its name, does not return a (decorated) function. What it does instead is it evaluates the function passed to it. I.e. in target_log_prob_fn = tf.py_function, target_log_prob_fn is a Tensor, not a function. What you want to do is:

def real_target_log_prob_fn(*param_vals):
  return tf.py_function(
    target_log_prob_fn, inp=param_vals, Tout=tf.float64
  )
hmc = tfp.mcmc.HamiltonianMontecarlo(real_target_log_prob_fn)

janosh commented 4 years ago

@SiegeLordEx Thanks for your continued help!

The actual error you're hitting is that tf.py_function, despite its name, does not return a (decorated) function. What it does instead is it evaluates the function passed to it.

That explains a few things! I was confused by why we needed to specify inp=param_vals to a decorator. 🤦

We were aware of the potential issue with return dys * grad and had already tried several variations. And as expected, after resolving

TypeError: 'tensorflow.python.framework.ops.EagerTensor' object is not callable

we got a new error saying

ValueError: ('custom_gradient function expected to return', 37, 'gradients but returned', 1, 'instead.')

That was easily fixed by return list(dys * grad). So now the log prob function looks like this:

@tf.custom_gradient
def target_log_prob_fn(*param_vals):
    log_likelihood = -penalty(param_vals, param_keys, ref_energies, "_tmp_optimizer")

    def grad_fn(*dys):
        grad = jacobian(param_vals, param_keys, ref_energies, "_tmp_optimizer")
        return list(dys * grad)

    return log_likelihood, grad_fn

def real_target_log_prob_fn(*param_vals):
    return tf.py_function(target_log_prob_fn, inp=param_vals, Tout=tf.float64)

This will run for two iterations and then throw:

tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute AddV2 as input #1(zero-based) was expected to be a float tensor but is a double tensor [Op:AddV2] name: mcmc_sample_chain/trace_scan/while/smart_for_loop/while/dual_averaging_step_size_adaptation___init__/_one_step/add

If we swap out tfp.mcmc.DualAveragingStepSizeAdaptation for tfp.mcmc.SimpleStepSizeAdaptation the error disappears, so maybe there's a bug in DualAveragingStepSizeAdaptation?

SiegeLordEx commented 4 years ago

Yeah, sounds like a bug. You can work around it by casting the step size argument to tf.float64 when constructing the NUTS kernel.

SiegeLordEx commented 4 years ago

Also, while testing this further inside a tf.function context, I noticed that tf.py_function can lose the output shape, which will trip up sample_chain. To fix that, you'll want to do something like:

def real_target_log_prob_fn(*param_vals):
    res = tf.py_function(target_log_prob_fn, inp=param_vals, Tout=tf.float64)
    res.set_shape(param_vals[0].shape[:-1]) # assumes parameter is vector-valued
    return res

janosh commented 4 years ago

You can work around it by casting the step size argument to tf.float64 when constructing the NUTS kernel.

That solved the problem. Would you accept a PR for that?

SiegeLordEx commented 4 years ago

That solved the problem. Would you accept a PR for that?

I think what's missing is that NUTS should have a _prepare_args function like the one in HMC which defers converting things to a Tensor until the dtype of the state is known.

janosh commented 4 years ago

@SiegeLordEx For some reason, HMC was unable to accept a single step on our actual problem. So we tried to recreate the functionality on a simpler problem which is to guess the parameters of the Branin-Hoo function from a bunch of samples of it. And it looks like we're running into the same problem there. Since this issue is much easier to replicate, could you take another look and let us know if you spot anything that's off?

Again, the full code is public.

`hmc-test.py`

from datetime import datetime
import numpy as np
from functools import partial

import tensorflow as tf

from hmc_utils import sample_chain, trace_fn, get_nuts_kernel
from bo_bench import (
    branin_hoo_params,
    sample_branin_hoo,
    branin_hoo_factory,
    branin_hoo_fn,
)
import plotly.graph_objects as go

# Plot the Branin-Hoo surface
xr = np.linspace(-5, 15, 21)
yr = np.linspace(0, 10, 11)
domain = np.stack(np.meshgrid(xr, yr), -1).reshape(-1, 2).T

surface = go.Surface(x=xr, y=yr, z=branin_hoo_fn(domain).reshape(len(yr), -1))
fig = go.Figure(data=[surface])
fig.update_layout(height=700, title_text="Branin-Hoo function")

# Generate random data set
xy, z_true = sample_branin_hoo(100)

def penalty(params):
    z_pred = branin_hoo_factory(*params)(xy)
    # Normally we'd just return -tf.metrics.mse(z_true, z_pred). But to test if
    # custom gradients are the reason HMC isn't accepting steps on MNDO, we
    # explicitly avoid autodiff.
    se = (z_true - z_pred) ** 2
    return se.mean()

def jacobian(params, dh=1e-5):
    """
    Args:
        params: values for each Branin-Hoo param
        dh: small value for numerical gradients
    """
    grad = np.zeros_like(params)

    for i in range(len(params)):
        params[i] += dh
        forward = penalty(params)

        params[i] -= 2 * dh
        backward = penalty(params)

        de = forward - backward
        grad[i] = de / (2 * dh)

        params[i] += dh  # undo in-place changes to params for next iteration
    return grad

@tf.custom_gradient
def custom_grad_target_log_prob_fn(*params):
    log_likelihood = -penalty([x.numpy() for x in params])

    def grad_fn(*dys):
        grad = jacobian([x.numpy() for x in params])
        return list(dys * grad)

    return log_likelihood, grad_fn

def target_log_prob_fn(params):
    res = tf.py_function(custom_grad_target_log_prob_fn, inp=params, Tout=tf.float64)
    # Avoid tripping up sample_chain due to loss of output shape in tf.py_function
    # when used in a tf.function context. https://tinyurl.com/y9ttqdpt
    res.set_shape(params[0].shape[:-1])  # assumes parameter is vector-valued
    return res

# With this function it works. With the above target_log_prob_fn, we can't accept steps.
# def target_log_prob_fn(param_vals):
#     z_pred = branin_hoo_factory(*param_vals)(xy)
#     return -tf.metrics.mse(z_true, z_pred)

now = datetime.now().strftime("%Y.%m.%d-%H:%M:%S")
log_dir = f"runs/hmc-trace/{now}"
summary_writer = tf.summary.create_file_writer(log_dir)

# Casting step_size and init_state needed due to TFP bug
# https://github.com/tensorflow/probability/issues/904#issuecomment-624272845
step_size = tf.cast(1e-1, tf.float64)
init_state = [v * 1.5 for v in branin_hoo_params.values()]
n_adapt_steps = 20

chain, trace, final_kernel_results = sample_chain(
    num_results=40,
    current_state=tf.constant(init_state, tf.float64),
    kernel=get_nuts_kernel(target_log_prob_fn, step_size, n_adapt_steps),
    return_final_kernel_results=True,
    trace_fn=partial(trace_fn, summary_writer=summary_writer),
)
burnin, samples = chain[:n_adapt_steps], chain[n_adapt_steps:]

plot_funcs = [
    [branin_hoo_fn, "Electric"],
    [branin_hoo_factory(*init_state), "Viridis"],  # default colorscale
    [branin_hoo_factory(*chain[-1].numpy()), "Blues"],
]
surfaces = [
    go.Surface(
        x=xr, y=yr, z=fn(domain).reshape(len(yr), -1), colorscale=cs, showscale=False
    )
    for fn, cs in plot_funcs
]
samples_plot = go.Scatter3d(x=xy[0], y=xy[1], z=z_true, mode="markers")
fig = go.Figure(data=[*surfaces, samples_plot])
title = "Branin-Hoo (bottom), initial surface (top), HMC final surface (middle)"
fig.update_layout(height=700, title_text=title)

`bo_bench.py`

import numpy as np

def branin_hoo_factory(a, b, c, r, s, t):
    def branin_hoo(x):
        # f(x) = a(y - b*x^2 + c*x - r)^2 + s (1 - t) cos(x) + s
        return (
            a * (x[1] - b * x[0] ** 2 + c * x[0] - r) ** 2
            + s * (1 - t) * np.cos(x[0])
            + s
        )

    return branin_hoo

branin_hoo_params = dict(
    a=1, b=5.1 / (4 * np.pi ** 2), c=5 / np.pi, r=6, s=10, t=1 / (8 * np.pi)
)

def branin_hoo_fn(x):
    """The Branin-Hoo function is a popular benchmark for Bayesian optimization.
    """
    z = branin_hoo_factory(**branin_hoo_params)(x)
    return z

def sample_branin_hoo(n_samples, domain=[[-5, 15], [0, 10]]):
    """Take samples from the Branin-Hoo function.

    Args:
        n_samples (int): number of samples to draw

    Returns:
        np.array: 2d array of x, y z points
        np.array: 1d array of z points
    """
    [x_min, x_max], [y_min, y_max] = domain

    xy = np.random.uniform(
        low=[x_min, y_min], high=[x_max, y_max], size=(n_samples, 2)
    ).T
    z = branin_hoo_fn(xy)

    return xy, z

junpenglao commented 4 years ago

Seems the step size is too large and the num_adapt_step is not enough to tune it yet, for example, setting the step size to smaller value seems to generate proposal the the HMC kernel will accept:

step_size = tf.cast(1e-5, tf.float64)
init_state = tf.constant([v * 1.5 for v in branin_hoo_params.values()], tf.float64)
n_adapt_steps = 20

kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    tfp.mcmc.NoUTurnSampler(
        target_log_prob_fn=target_log_prob_fn,
        step_size=step_size),
    num_adaptation_steps=n_adapt_steps
)

pkr = kernel.bootstrap_results(init_state)
next_state, next_pkr = kernel.one_step(init_state, pkr)
next_pkr.inner_results.is_accepted

janosh commented 4 years ago

@junpenglao Sorry, I was doing a sweep over the step size before posting here and forgot to revert the value. step_size and n_adapt_steps are usually set to this:

step_size = tf.cast(1e-3, tf.float64)
init_state = [v * 1.5 for v in branin_hoo_params.values()]
n_adapt_steps = 200

janosh commented 4 years ago

Just as a visual confirmation that HMC isn't moving, here's the target surface (bottom), initial surface (top) and surface parametrized by the final sample in the chain (in a blue white color scale). The initial and final surface coincide.

Screen Shot 2020-05-25 at 17 03 11

If I don't use tf.custom_gradient and tf.py_function and simply rely on autodiff, the final surface coincides with the true Branin-Hoo surface instead. So there must be something wrong in how I use those functions or how they interact with TFP's mcmc module.

surfaces

junpenglao commented 4 years ago

I think the gradient computation is not quite right - it should be the output of jacobian directly. See this reproducible colab: https://colab.research.google.com/drive/10npUrXjLuZLJMt2229U7lF0B34LBbFqj?usp=sharing

SiegeLordEx commented 4 years ago

In general, if you're concerned about the correctness of your analytical gradients, you can check them numerically. TensorFlow has a utility for that: https://www.tensorflow.org/api_docs/python/tf/test/compute_gradient

tensorflow / probability

Custom gradients in HMC? #904

`hmc-test.py`

`bo_bench.py`