Open foxik opened 2 years ago
Adding the author of the commit removing documented functionality: @jonycgn .
Alternatively, some variant of write_scalar_summaries
could be moved to on_train_batch_end
method of TensorBoard
callback.
The mentioned commit didn't remove functionality from the TensorBoard
callback, but from Model
. The problem with these summaries was that they would be written unconditionally after each training step, regardless of whether the TensorBoard
callback was even used. So reverting wouldn't make the code behave correctly.
The functionality should be corrected in the callback. It is unclear to me why it doesn't exist. It seems the update_freq
parameter is simply ignored unless it is set to 'epoch'
.
Adding @rchao who might be more knowledgeable on this.
Hi,
thanks for answer! The commit actually did remove functionality from TensorBoard
callback -- the callback just sets correct tf.summary.record_if
in https://github.com/keras-team/keras/blob/46121eed08d0feef743eacef3b66206df45cf656/keras/callbacks.py#L2362-L2373
but it relied on the write_scalar_summaries
executed in the keras/engine/training.py
, the one you removed. So once this "generic summary emitter" was removed, it broke the batch-level logs in TensorBoard
callback.
But I agree that it was quite weird to collect logs this way, and reverting is not a good idea, instead the logs should be collected only in on_train_batch_end
in the TensorBoard
callback.
One thing to note is also that these logs were not actual batch-level summaries, which was a reason to remove them. They are accumulated from the beginning of the epoch to the current batch, since that is how Metric
s work. They return the current cumulative average. So the documentation was misleading, and I don't think this type of summary is actually very helpful.
A more effective way to add true batch-level summaries for now is just to wrap everything in a summary writer context. Then you can also create batch-level summaries anywhere in the Model
code for debugging.
For example:
debug_writer = tf.summary.create_file_writer(work_path + "/debug")
with debug_writer.as_default(step=model.step):
model.fit(dataset)
debug_writer.flush()
It looks like the code you linked to could act as a replacement for the example I gave above, since it takes care of entering a writer context when training begins.
So to get true batch summaries, you'd just need to set update_freq
to the desired number of batches and add tf.summary.scalar
calls on whatever tensors you want to log.
We could consider to make the default Model.train_step
implementation call tf.summary.scalar
on every value that is added to a metric. Then the batch summaries would be correct.
One thing to note is also that these logs were not actual batch-level summaries, which was a reason to remove them. They are accumulated from the beginning of the epoch to the current batch, since that is how
Metric
s work. They return the current cumulative average. So the documentation was misleading, and I don't think this type of summary is actually very helpful.
I am not sure the documentation was explicitly stating that the recorded numbers were per-batch. Instead, I would expect the metrics to be aggregated, and I would expect the recorded numbers to be the same as the ones shown on the Progbar on the console during training (which are also averaged from the beginning of epoch, even loss). If you have freq=100
, then the individual batch accuracies etc will fluctuate a lot, while an average from the beginning of the epoch might be more informative.
We could consider to make the default
Model.train_step
implementation calltf.summary.scalar
on every value that is added to a metric. Then the batch summaries would be correct.
I do not think it is a good solution -- if a user modifies Model.train_step
and uses TensorBoard
callback, they would loose the logs, which seems counter-intuitive. If we want to collect raw batch logs, could it maybe happen in make_train_function
?
But personally I would prefer to just call tf.summary.scalar
in on_train_batch_end
of the TensorBoard
callback. Notably, I am currently using the following code in on_train_batch_end
(to get batch summaries, so that I can track progress for runs which train a single epoch for a long time):
if logs:
for name, value in logs.items():
tf.summary.scalar("batch_" + name, value, step=self._train_step)
Cheers!
TensorBoard does not log the batches metrics even when update_freq=N
is set.
Seems one question to answer first is: Do we want to log the aggregated metrics of only the N
batches or log the aggregated of all batches from the start of the epoch?
The fix is not clear yet. A related commit: https://github.com/keras-team/keras/commit/7d06227f3ef653a451fffe65d1d7dace7f7945df
Ok, I agree, it isn't a great idea to scatter this functionality across Model
and the callback. The implementation is opaque and not easy to understand, and that was what prompted me to request removing it in the first place.
I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.
At the very least, "batch summary" to me implies "summary over the batch", so if the code produces cumulative summaries, that should be more clearly documented.
I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.
On second thought, at least for loss, I think you are right -- maybe I am too accustomed to the previous implementation :-)
But for stateful metrics, this quantity is never even computed (i.e., even inside train_step
, only the aggregated values are available), and for some metrics, averaging batch-level values can be quite different than computing the metric on the whole data -- for example, if you compute let's say F1 in NER, sentences without entities gives you usually F1 of 100% (or 0%, because the quantity is not defined), but have no influence on the final precision & recall, so averaging them (e.g. via TensorBoard smoothing) is not a good idea.
So maybe the current value of loss, and whatever the stateful metrics compute?
But the loss returned from train_step
is already averaged (tracked by the _loss_metric
in the LossesContainer
and returned by Model.compute_metrics
). If we would want to avoid emitting the summaries directly from train_step
, maybe the LossesContainer
could be augmented to include the latest loss values and we could try obtaining them in the TensorBoard
callback and replace the loss means if successful?
From Keras side we should ultimately continue to support per-batch logging of metrics to TensorBoard given the interests and backward-compatibility functionality wise (ideally, without the need of an overridden train_step() or call()).
I'm investigating into a solution and will post back as soon as I find something.
Hi, I've just see this issue. According to the thread, should I subclass the callback to implement the batch-wise collection like the following?
class CustomTensorBoard(tf.keras.callbacks.TensorBoard): # type: ignore
"""TensorBoard callback with update_freq=N functionality."""
def _implements_train_batch_hooks(self) -> bool:
return super()._implements_train_batch_hooks() or isinstance(
self.update_freq, int
)
def on_train_batch_end(
self,
batch: int,
logs: Optional[Dict[str, float]] = None,
) -> None:
super().on_train_batch_end(batch, logs)
if batch % self.update_freq == 0 and logs is not None:
with tf.summary.record_if(True), self._train_writer.as_default():
for name, value in logs.items():
tf.summary.scalar("batch_" + name, value, step=self._train_step)
Hello all,
Just as @kyamagu, I run into this problem after migrating from TensorFlow 1 to the latest 2.9.1. In my case, the information is properly reported per batch in TensorBoard but only after the first epoch has elapsed. As a consequence, no information is reported for several minutes or even hours.
Depending on how long an epoch takes, this may be quite critical (although you can use Keras progress bar to identify early divergence or numerical problems).
More importantly, as a temporary fix, @kyamagu does the job and per-batch metrics are reported just after training starts. Thanks so much!
Hello everyone,
I also ran into the problem described above (update_freq='batch'
in TensorBoard callback not taken into account). Thanks to @kyamagu solution I was able to solve this issue (Is it going on a release at some point ?)
However, as mentioned by @jonycgn:
I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.
The cumulative aspect of Metrics
is not really well-suited to plot scalar values. First, because of what @jonycgn explained (i.e the smoothing feature of TensorBoard
should be the only responsible of how the plot looks like). But also because it creates large gap in the graph between epochs (as you can see in the image here below)
So, I'm wondering, is it possible to obtain the actual loss value at each batch (not the accumulated one) so I can make my own tf.summary.scalar(...)
in train_step ? Here I'm not sure, since @jonycgn also said:
They are accumulated from the beginning of the epoch to the current batch, since that is how Metrics work. They return the current cumulative average.
Also, is the accumulated values in TensorBoard behavior is choice or is it destined to be fixed ?
I think all options are important: 1) per batch 2) every 1000 (user configured) batches 3) every epoch
Do remind that sometimes there is no concept of an epoch, and per batch logging is too expensive, then 2) becomes quite important
Just adding some more user feedback, I also agree that having the immediate batch metrics is a key and important user experience when training. Not only can it help you catch, very quickly, if you might have done something wrong, it is nice to see that something is happening. For example: if it takes 40 minutes to process an epoch of data, I don't want to wait 40 minutes to see how my model is performing.
I have added a section called "Batch-level logging" to a tutorial in the Tensorboard Github: https://github.com/tensorflow/tensorboard/blob/master/docs/scalars_and_keras.ipynb
I think the more common use case is addressed in the section titled "Instantaneous batch-level logging." I hope this guide is helpful to users who previously tried using update_freq.
Is it possible to add the same custom update metric frequency also for the validation set? @kyamagu
@hellbago Probably you can override _implements_test_batch_hooks()
and on_test_batch_end()
.
I tried something like this but it didn't work:
def _implements_test_batch_hooks(self) -> bool:
return super()._implements_test_batch_hooks() or isinstance(
self.update_freq, int
)
def on_test_batch_end(
self,
batch: int,
logs: Optional[Dict[str, float]] = None,
) -> None:
super().on_test_batch_end(batch, logs)
if batch % self.update_freq == 0 and logs is not None:
with tf.summary.record_if(True), self._val_writer.as_default():
for name, value in logs.items():
tf.summary.scalar("batch_" + name, value, step=self._test_step)
@kyamagu
@nkovela1 is it also possible to extend your solution to update every update_freq
batches?
System information.
Describe the problem.
According to https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard, the value
update_freq=N
should collect logs everyN
batches. However, no logs are generated after batches, only after an epoch.Describe the current behavior.
No logs are generated after
N
training batches.Describe the expected behavior.
Logs should be generated after
N
training batches.Details
The feature for collecting
batch_*
summaries was removed in https://github.com/keras-team/keras/commit/7d06227f3ef653a451fffe65d1d7dace7f7945df -- see howwrite_scalar_summaries
was removed fromkeras/engine/training.py
.It should be enough to just revert the given commit.