tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow
https://www.tensorflow.org/probability/
Apache License 2.0
4.25k stars 1.1k forks source link

TensorBoard in TFP and TF v2 #356

Closed janosh closed 5 years ago

janosh commented 5 years ago

There don't appear to be any docs on how to use TensorBoard with TensorFlow Probability. I'm specifically interested in a guide for the 2.0 release. Is this planned or am I missing something?

csuter commented 5 years ago

We don't have any explicit TB features in TFP, but you should be able to monitor anything you're interested in using tf.summary and friends. You can pass any Tensor you want to those.

Is there something in particular you're trying to do? Maybe we can help a bit with idioms.

janosh commented 5 years ago

Yes, I'm trying to monitor the progress and final results of training a Bayesian NN with HMC. I tried writing a trace_fn and passing that to tfp.mcmc.sample_chain, i.e. something like

def trace_fn(weights, kernel_results):
    print("weights", weights)
    print("kernel_results", kernel_results)

@tf.function
def run_hmc(
    num_results=100,
    num_burnin_steps=0,
    step_size=0.01,
    current_state=get_initial_state(),
    num_steps_between_results=0,
):
    hmc_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
        tfp.mcmc.HamiltonianMonteCarlo(
            target_log_prob_fn=joint_log_prob_fn,
            num_leapfrog_steps=2,
            step_size=step_size,
            state_gradients_are_stopped=True,
        ),
        num_adaptation_steps=num_results + num_burnin_steps,
    )
    weights, kernel_results = tfp.mcmc.sample_chain(
        num_results=num_results,
        num_burnin_steps=num_burnin_steps,
        current_state=current_state,
        kernel=hmc_kernel,
        trace_fn=trace_fn,
    )
    print("Acceptance rate:", kernel_results.inner_results.is_accepted.numpy().mean())

but whatever signature I use or action I take in that function, it causes the whole operation to come crashing down. Some docs or guidance on this would be much appreciated!

csuter commented 5 years ago

Ah yeah, maybe this is a documentation bug -- check the docs on trace_fn in sample_chain and let me know if you think we could improve the verbiage there.

Basically, trace_fn gets to look at the current chain states and "kernel results" structures at each step, and decide which values to create traces of. These traces are what are returned in the kernel_results return value from sample_chain. So, e.g. if you wanted to keep track of is_accepted, but throw away everything else, you could do

def trace_fn(current_state, kernel_results)
  return kernel_results.inner_results.is_acceted

weights, kernel_results = tfp.mcmc.sample_chain(...)

kernel_results would then be a single Tensor with shape [num_results], containing the value of is_accepted at each of the num_results steps at which a result was computed.

You can also return more complicated nested structures (tuples, named_tuples, dicts [i think...]) from trace_fn.

I guess you could also make calls to tf.summary in that function (I'm not sure this will won't badly degrade performance), but you do need to return a valid Tensor, otherwise there'll definitely be some crashiness like you're seeing.

@SiegeLordEx may have something to add to what I've said.

SiegeLordEx commented 5 years ago

What @csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested):

def trace_fn(weights, results):
   with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)):
     tf.compat.v2.summary.histogram(weights, step=results.step)
   return ()

Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.

It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.

brianwa84 commented 5 years ago

I don't expect summaries inside the trace fn to work because they sit inside a while control for context. Summaries must be fetchable at the top level of the graph. Are you running a chain for so long that you want summaries out mid execution? For that I think you would want to run sample_chain for n steps, output a summary, then resume sampling, which iirc is supported well.

On Wed, Apr 10, 2019, 12:11 AM Pavel Sountsov notifications@github.com wrote:

What @csuter https://github.com/csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested):

def trace_fn(weights, results): with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)): tf.compat.v2.summary.histogram(weights, step=results.step) return ()

Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.

It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/356#issuecomment-481525826, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI19dgud6h8mGMgfNtBc1o7NGS_gxks5vfWRYgaJpZM4cipoZ .

SiegeLordEx commented 5 years ago

That's true only of V1 summaries, V2 summaries are just regular ops with a side-effect of writing to a file. Here's a complete working example:

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

dist = tfd.Normal(0., 1.)

kernel = tfp.mcmc.SimpleStepSizeAdaptation(tfp.mcmc.HamiltonianMonteCarlo(dist.log_prob, step_size=0.1, num_leapfrog_steps=3), num_adaptation_steps=100)

summary_writer = tf.compat.v2.summary.create_file_writer('/tmp/summary_chain', flush_millis=10000)

def trace_fn(state, results):
  with tf.compat.v2.summary.record_if(tf.equal(results.step % 10, 1)):
    tf.compat.v2.summary.scalar("state", state, step=tf.cast(results.step, tf.int64))
  return ()

with summary_writer.as_default():
  chain, _ = tfp.mcmc.sample_chain(kernel=kernel, current_state=0., num_results=200, trace_fn=trace_fn)

summary_writer.close()

There is a bit of an annoyance in that the summaries use the name scope of where they are as the name, which leaks a whole bunch of internal implementation details of sample_chain... I don't have a solution for this yet.

janosh commented 5 years ago

@SiegeLordEx I found the same thing, creating summaries in trace_fn seems to work well. I also didn't notice any slow-down but I'll check that more carefully later. However, both with my own implementation and your code, I'm unable to open the summary in TensorBoard. In both cases tensorboard --logdir ./tmp/summary_chain throws

Exception in thread Reloader:
AttributeError: module 'tensorflow._api.v2.compat.v1' has no attribute 'pywrap_tensorflow'

followed by

W0410 17:26:13.712886 123145489154048 core_plugin.py:172] Unable to get first event timestamp for run .: No event timestamp could be found

and an empty TB dashboard. I'm running the latest tb-nightly. Any ideas what's causing this?

janosh commented 5 years ago

@brianwa84 That's a great suggestion. I'll try that as soon as I have a working implementation.

SiegeLordEx commented 5 years ago

@janosh Not sure, my TensorBoard works okay. I'd try things out without TFP, just:

summary_writer = 
with summary_writer.as_default():
   tf.compat.v2.summary.scalar(...)
summary_writer.close()

And make sure that works. Maybe it's just some TF2 incompatibility nonsense which has nothing to do with TFP.

janosh commented 5 years ago

Same problem without tfp. I'll file another issue in the main repo.

janosh commented 5 years ago

@brianwa84 What would be the best way of resuming the calculation? Just pass the last state of the previous run into the next one and then concatenate the results of all runs for final diagnostics? E.g.

hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn, step_size=step_size, num_leapfrog_steps=num_leapfrog_steps
)
adaptive_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
    hmc_kernel, num_adaptation_steps=num_adaptation_steps
)

chain1, (_, kernel_results1) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

# Some mid-execution diagnostics

chain2, (_, kernel_results2) = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states1[-1],
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)

chain = tf.concat((chain1, chain2), 0)

But then how to merge the kernel results kernel_results1 and kernel_results2? They are each classes (SimpleStepSizeAdaptation) and it appears as though I would have to merge their attributes like adaptation_rate, new_step_size, inner_results.is_accepted, inner_results.log_accept_ratio, etc. individually. That seems like a lot of manual work and not so much like "supported well" so I'm guessing I'm doing something wrong?

brianwa84 commented 5 years ago

Something like that:

state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=current_state,
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain1, (_, kernel_results1) = state, kernel_results

# Some mid-execution diagnostics
state, kernel_results = tfp.mcmc.sample_chain(
    kernel=adaptive_kernel,
    current_state=states[-1],  # or tf.[compat.v2.]nest.map_structure(lambda x:x[-1], states)
    previous_kernel_results=kernel_results,   # This line is new.
    num_results=num_results,
    num_steps_between_results=num_steps_between_results,
    trace_fn=partial(trace_fn, summary_freq=5),
)
chain2, (_, kernel_results2) = state, kernel_results

chain = tf.concat((chain1, chain2), 0)
brianwa84 commented 5 years ago

Re: how to merge the kernel results You can use tf.nest.map_structure to map the tf.concat over everything in there.

brianwa84 commented 5 years ago

@SiegeLordEx should what I put above work?

SiegeLordEx commented 5 years ago

Thanks @brianwa84. Yes, it's something like that. Here's a 'loop' version of the above:

kernel_results = kernel.boostrap_results(current_state)
chain_blocks = []
trace_blocks = []
for i in range(num_blocks):
    chain, trace, kernel_results = tfp.mcmc.sample_chain(
        current_state=current_state,
        previous_kernel_results=kernel_results,
        trace_fn=...,
        return_final_kernel_results=True,
        )

    # Do your partial analysis here.

    current_state = tf.nest.map_structure(lambda x: x[-1])
    chain_blocks.append(chain)
    trace_blocks.append(trace)

full_chain = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *chain_blocks)
full_trace = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *trace_blocks)

# full_trace/full_chain now contain num_blocks * num_results elements
janosh commented 5 years ago

@SiegeLordEx Why do you need kernel_results = kernel.boostrap_results(current_state)? Wouldn't kernel_results = None work?

Also, what's the advantage of

current_state = tf.nest.map_structure(lambda x: x[-1], chain)

over

current_state = chain[-1]
SiegeLordEx commented 5 years ago

kernel_results = None will work, but I wanted to illustrate the loop such that it had no Python control flow in it. Eschewing Python control lets us use tf.function efficiently to speed up that computation. It's a minor point as far as the example goes, but it's just more natural to me to write it that way.

tfp.mcmc supports list-valued chain states, so current_state might actually be a list of Tensors, each of which needs to be indexed separately. It's just a bit more general that way.

viotemp1 commented 4 years ago

For loss in TB: ################################################################ def write_TB_metrics(metric={}, step=0, metrics_file_writer=None): with metrics_file_writer.as_default(): with name_scope(tb_metrics_name_scope): for key in metric.keys(): value = metric[key] summary.scalar(key, value, step=step) metrics_file_writer.flush() metrics_file_writer = summary.create_file_writer(LOG_DIR_METRICS) ################################################################

@tf.function()

def trace_fn(traceable_quantities): if write_metrics_tb: write_TB_metrics(metric={'loss': traceable_quantities.loss}, step=traceable_quantities.step, metrics_file_writer=metrics_file_writer)

print("step", traceable_quantities.step)

#print("loss", traceable_quantities.loss)
return traceable_quantities.loss

################################################################ ... loss_curve = tfp.vi.fit_surrogate_posterior( target_log_prob_fn=target_log_prob_fn, surrogate_posterior=variational_posteriors, optimizer=optimizer, num_steps=num_variational_steps, trace_fn=trace_fn, seed=42 ) Screenshot 2020-04-06 at 18 20 05

merplumander commented 3 years ago

About resuming:

I had hoped that, when setting random seeds, resuming and running the full chain from the beginning would produce the same results, but it doesn't. Is this expected behavior or am I doing something wrong?

Here's a minimal example building on the code that @SiegeLordEx provided (Python 3.6.5; tensorflow==2.3.1; tensorflow-probability==0.11.1):

def target_log_prob(x):
    return -x - x ** 2.0

current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
for i in range(2):

    chain, trace, kernel_results = tfp.mcmc.sample_chain(
        num_results=3,
        current_state=current_state,
        previous_kernel_results=kernel_results,
        trace_fn=trace_fn,
        return_final_kernel_results=True,
        kernel=kernel,
    )

    current_state = tf.nest.map_structure(lambda x: x[-1], chain)
    chain_blocks.append(chain)

full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.95076746,  0.12316042,  0.5397935 , -0.21367444, -0.21657643,
       -1.0244453 ], dtype=float32)>

# Let's do it all again but now without a break in between:

current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    kernel, num_adaptation_steps=0
)

kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
chain, trace, kernel_results = tfp.mcmc.sample_chain(
    num_results=6,
    current_state=current_state,
    previous_kernel_results=kernel_results,
    trace_fn=trace_fn,
    return_final_kernel_results=True,
    kernel=kernel,
)

chain_blocks.append(chain)

full_chain = tf.nest.map_structure(
    lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.95076746, 0.12316042, 0.5397935 , 1.1745309 , 0.37639475,
       0.19865556], dtype=float32)>

So the two chains produce the same samples up to step three (as they must since I set a random seed), but produce different samples after resuming. Is there a way to make these two produce equivalent results by setting some internal seeds?

Appreciating every feedback :)