Closed janosh closed 5 years ago
We don't have any explicit TB features in TFP, but you should be able to monitor anything you're interested in using tf.summary and friends. You can pass any Tensor
you want to those.
Is there something in particular you're trying to do? Maybe we can help a bit with idioms.
Yes, I'm trying to monitor the progress and final results of training a Bayesian NN with HMC. I tried writing a trace_fn
and passing that to tfp.mcmc.sample_chain
, i.e. something like
def trace_fn(weights, kernel_results):
print("weights", weights)
print("kernel_results", kernel_results)
@tf.function
def run_hmc(
num_results=100,
num_burnin_steps=0,
step_size=0.01,
current_state=get_initial_state(),
num_steps_between_results=0,
):
hmc_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=joint_log_prob_fn,
num_leapfrog_steps=2,
step_size=step_size,
state_gradients_are_stopped=True,
),
num_adaptation_steps=num_results + num_burnin_steps,
)
weights, kernel_results = tfp.mcmc.sample_chain(
num_results=num_results,
num_burnin_steps=num_burnin_steps,
current_state=current_state,
kernel=hmc_kernel,
trace_fn=trace_fn,
)
print("Acceptance rate:", kernel_results.inner_results.is_accepted.numpy().mean())
but whatever signature I use or action I take in that function, it causes the whole operation to come crashing down. Some docs or guidance on this would be much appreciated!
Ah yeah, maybe this is a documentation bug -- check the docs on trace_fn
in sample_chain
and let me know if you think we could improve the verbiage there.
Basically, trace_fn gets to look at the current chain states and "kernel results" structures at each step, and decide which values to create traces of. These traces are what are returned in the kernel_results
return value from sample_chain
. So, e.g. if you wanted to keep track of is_accepted, but throw away everything else, you could do
def trace_fn(current_state, kernel_results)
return kernel_results.inner_results.is_acceted
weights, kernel_results = tfp.mcmc.sample_chain(...)
kernel_results
would then be a single Tensor
with shape [num_results]
, containing the value of is_accepted
at each of the num_results
steps at which a result was computed.
You can also return more complicated nested structures (tuples, named_tuples, dicts [i think...]) from trace_fn.
I guess you could also make calls to tf.summary in that function (I'm not sure this will won't badly degrade performance), but you do need to return a valid Tensor
, otherwise there'll definitely be some crashiness like you're seeing.
@SiegeLordEx may have something to add to what I've said.
What @csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary
calls inside trace_fn
, something like this (untested):
def trace_fn(weights, results):
with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)):
tf.compat.v2.summary.histogram(weights, step=results.step)
return ()
Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.
It might also make sense to run sample_chain
without summaries, and then iterate over the return values of sample_chain
(I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.
I don't expect summaries inside the trace fn to work because they sit inside a while control for context. Summaries must be fetchable at the top level of the graph. Are you running a chain for so long that you want summaries out mid execution? For that I think you would want to run sample_chain for n steps, output a summary, then resume sampling, which iirc is supported well.
On Wed, Apr 10, 2019, 12:11 AM Pavel Sountsov notifications@github.com wrote:
What @csuter https://github.com/csuter said is correct. Indeed, if you want to track your weights over time on TensorBoard, you'd place tf.summary calls inside trace_fn, something like this (untested):
def trace_fn(weights, results): with tf.compat.v2.summary.record_if(tf.equal(results.step % 100, 0)): tf.compat.v2.summary.histogram(weights, step=results.step) return ()
Note how I set it up to record every 100 steps, for efficiency, but you can do whatever suits your needs.
It might also make sense to run sample_chain without summaries, and then iterate over the return values of sample_chain (I can imagine this playing nicer on the GPU), but obviously you'd lose the in-progress display of your statistics.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/356#issuecomment-481525826, or mute the thread https://github.com/notifications/unsubscribe-auth/AVJZI19dgud6h8mGMgfNtBc1o7NGS_gxks5vfWRYgaJpZM4cipoZ .
That's true only of V1 summaries, V2 summaries are just regular ops with a side-effect of writing to a file. Here's a complete working example:
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
dist = tfd.Normal(0., 1.)
kernel = tfp.mcmc.SimpleStepSizeAdaptation(tfp.mcmc.HamiltonianMonteCarlo(dist.log_prob, step_size=0.1, num_leapfrog_steps=3), num_adaptation_steps=100)
summary_writer = tf.compat.v2.summary.create_file_writer('/tmp/summary_chain', flush_millis=10000)
def trace_fn(state, results):
with tf.compat.v2.summary.record_if(tf.equal(results.step % 10, 1)):
tf.compat.v2.summary.scalar("state", state, step=tf.cast(results.step, tf.int64))
return ()
with summary_writer.as_default():
chain, _ = tfp.mcmc.sample_chain(kernel=kernel, current_state=0., num_results=200, trace_fn=trace_fn)
summary_writer.close()
There is a bit of an annoyance in that the summaries use the name scope of where they are as the name, which leaks a whole bunch of internal implementation details of sample_chain... I don't have a solution for this yet.
@SiegeLordEx I found the same thing, creating summaries in trace_fn
seems to work well. I also didn't notice any slow-down but I'll check that more carefully later. However, both with my own implementation and your code, I'm unable to open the summary in TensorBoard. In both cases tensorboard --logdir ./tmp/summary_chain
throws
Exception in thread Reloader:
AttributeError: module 'tensorflow._api.v2.compat.v1' has no attribute 'pywrap_tensorflow'
followed by
W0410 17:26:13.712886 123145489154048 core_plugin.py:172] Unable to get first event timestamp for run .: No event timestamp could be found
and an empty TB dashboard. I'm running the latest tb-nightly
. Any ideas what's causing this?
@brianwa84 That's a great suggestion. I'll try that as soon as I have a working implementation.
@janosh Not sure, my TensorBoard works okay. I'd try things out without TFP, just:
summary_writer =
with summary_writer.as_default():
tf.compat.v2.summary.scalar(...)
summary_writer.close()
And make sure that works. Maybe it's just some TF2 incompatibility nonsense which has nothing to do with TFP.
Same problem without tfp
. I'll file another issue in the main repo.
@brianwa84 What would be the best way of resuming the calculation? Just pass the last state of the previous run into the next one and then concatenate the results of all runs for final diagnostics? E.g.
hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn, step_size=step_size, num_leapfrog_steps=num_leapfrog_steps
)
adaptive_kernel = tfp.mcmc.SimpleStepSizeAdaptation(
hmc_kernel, num_adaptation_steps=num_adaptation_steps
)
chain1, (_, kernel_results1) = tfp.mcmc.sample_chain(
kernel=adaptive_kernel,
current_state=current_state,
num_results=num_results,
num_steps_between_results=num_steps_between_results,
trace_fn=partial(trace_fn, summary_freq=5),
)
# Some mid-execution diagnostics
chain2, (_, kernel_results2) = tfp.mcmc.sample_chain(
kernel=adaptive_kernel,
current_state=states1[-1],
num_results=num_results,
num_steps_between_results=num_steps_between_results,
trace_fn=partial(trace_fn, summary_freq=5),
)
chain = tf.concat((chain1, chain2), 0)
But then how to merge the kernel results kernel_results1
and kernel_results2
? They are each classes (SimpleStepSizeAdaptation
) and it appears as though I would have to merge their attributes like adaptation_rate
, new_step_size
, inner_results.is_accepted
, inner_results.log_accept_ratio
, etc. individually. That seems like a lot of manual work and not so much like "supported well" so I'm guessing I'm doing something wrong?
Something like that:
state, kernel_results = tfp.mcmc.sample_chain(
kernel=adaptive_kernel,
current_state=current_state,
num_results=num_results,
num_steps_between_results=num_steps_between_results,
trace_fn=partial(trace_fn, summary_freq=5),
)
chain1, (_, kernel_results1) = state, kernel_results
# Some mid-execution diagnostics
state, kernel_results = tfp.mcmc.sample_chain(
kernel=adaptive_kernel,
current_state=states[-1], # or tf.[compat.v2.]nest.map_structure(lambda x:x[-1], states)
previous_kernel_results=kernel_results, # This line is new.
num_results=num_results,
num_steps_between_results=num_steps_between_results,
trace_fn=partial(trace_fn, summary_freq=5),
)
chain2, (_, kernel_results2) = state, kernel_results
chain = tf.concat((chain1, chain2), 0)
Re: how to merge the kernel results You can use tf.nest.map_structure to map the tf.concat over everything in there.
@SiegeLordEx should what I put above work?
Thanks @brianwa84. Yes, it's something like that. Here's a 'loop' version of the above:
kernel_results = kernel.boostrap_results(current_state)
chain_blocks = []
trace_blocks = []
for i in range(num_blocks):
chain, trace, kernel_results = tfp.mcmc.sample_chain(
current_state=current_state,
previous_kernel_results=kernel_results,
trace_fn=...,
return_final_kernel_results=True,
)
# Do your partial analysis here.
current_state = tf.nest.map_structure(lambda x: x[-1])
chain_blocks.append(chain)
trace_blocks.append(trace)
full_chain = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *chain_blocks)
full_trace = tf.nest.map_structure(lambda *parts: tf.concat(parts, axis=0), *trace_blocks)
# full_trace/full_chain now contain num_blocks * num_results elements
@SiegeLordEx Why do you need kernel_results = kernel.boostrap_results(current_state)
? Wouldn't kernel_results = None
work?
Also, what's the advantage of
current_state = tf.nest.map_structure(lambda x: x[-1], chain)
over
current_state = chain[-1]
kernel_results = None
will work, but I wanted to illustrate the loop such that it had no Python control flow in it. Eschewing Python control lets us use tf.function
efficiently to speed up that computation. It's a minor point as far as the example goes, but it's just more natural to me to write it that way.
tfp.mcmc
supports list-valued chain states, so current_state
might actually be a list of Tensors, each of which needs to be indexed separately. It's just a bit more general that way.
For loss in TB: ################################################################ def write_TB_metrics(metric={}, step=0, metrics_file_writer=None): with metrics_file_writer.as_default(): with name_scope(tb_metrics_name_scope): for key in metric.keys(): value = metric[key] summary.scalar(key, value, step=step) metrics_file_writer.flush() metrics_file_writer = summary.create_file_writer(LOG_DIR_METRICS) ################################################################
def trace_fn(traceable_quantities): if write_metrics_tb: write_TB_metrics(metric={'loss': traceable_quantities.loss}, step=traceable_quantities.step, metrics_file_writer=metrics_file_writer)
#print("loss", traceable_quantities.loss)
return traceable_quantities.loss
################################################################ ... loss_curve = tfp.vi.fit_surrogate_posterior( target_log_prob_fn=target_log_prob_fn, surrogate_posterior=variational_posteriors, optimizer=optimizer, num_steps=num_variational_steps, trace_fn=trace_fn, seed=42 )
About resuming:
I had hoped that, when setting random seeds, resuming and running the full chain from the beginning would produce the same results, but it doesn't. Is this expected behavior or am I doing something wrong?
Here's a minimal example building on the code that @SiegeLordEx provided (Python 3.6.5; tensorflow==2.3.1; tensorflow-probability==0.11.1):
def target_log_prob(x):
return -x - x ** 2.0
current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
kernel, num_adaptation_steps=0
)
kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
for i in range(2):
chain, trace, kernel_results = tfp.mcmc.sample_chain(
num_results=3,
current_state=current_state,
previous_kernel_results=kernel_results,
trace_fn=trace_fn,
return_final_kernel_results=True,
kernel=kernel,
)
current_state = tf.nest.map_structure(lambda x: x[-1], chain)
chain_blocks.append(chain)
full_chain = tf.nest.map_structure(
lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([ 0.95076746, 0.12316042, 0.5397935 , -0.21367444, -0.21657643,
-1.0244453 ], dtype=float32)>
# Let's do it all again but now without a break in between:
current_state = 1.0
tf.random.set_seed(0)
kernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=target_log_prob, step_size=0.01, num_leapfrog_steps=5
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
kernel, num_adaptation_steps=0
)
kernel_results = kernel.bootstrap_results(current_state)
chain_blocks = []
chain, trace, kernel_results = tfp.mcmc.sample_chain(
num_results=6,
current_state=current_state,
previous_kernel_results=kernel_results,
trace_fn=trace_fn,
return_final_kernel_results=True,
kernel=kernel,
)
chain_blocks.append(chain)
full_chain = tf.nest.map_structure(
lambda *parts: tf.concat(parts, axis=0), *chain_blocks
)
full_chain
==> <tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.95076746, 0.12316042, 0.5397935 , 1.1745309 , 0.37639475,
0.19865556], dtype=float32)>
So the two chains produce the same samples up to step three (as they must since I set a random seed), but produce different samples after resuming. Is there a way to make these two produce equivalent results by setting some internal seeds?
Appreciating every feedback :)
There don't appear to be any docs on how to use TensorBoard with TensorFlow Probability. I'm specifically interested in a guide for the 2.0 release. Is this planned or am I missing something?