keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
64 stars 30 forks source link

TensorBoard callback does not collect data per batch (accidentally removed) #641

Open foxik opened 2 years ago

foxik commented 2 years ago

System information.

Describe the problem.

According to https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard, the value update_freq=N should collect logs every N batches. However, no logs are generated after batches, only after an epoch.

Describe the current behavior.

No logs are generated after N training batches.

Describe the expected behavior.

Logs should be generated after N training batches.

Details

The feature for collecting batch_* summaries was removed in https://github.com/keras-team/keras/commit/7d06227f3ef653a451fffe65d1d7dace7f7945df -- see how write_scalar_summaries was removed from keras/engine/training.py.

It should be enough to just revert the given commit.

foxik commented 2 years ago

Adding the author of the commit removing documented functionality: @jonycgn .

foxik commented 2 years ago

Alternatively, some variant of write_scalar_summaries could be moved to on_train_batch_end method of TensorBoard callback.

jonarchist commented 2 years ago

The mentioned commit didn't remove functionality from the TensorBoard callback, but from Model. The problem with these summaries was that they would be written unconditionally after each training step, regardless of whether the TensorBoard callback was even used. So reverting wouldn't make the code behave correctly.

The functionality should be corrected in the callback. It is unclear to me why it doesn't exist. It seems the update_freq parameter is simply ignored unless it is set to 'epoch'.

Adding @rchao who might be more knowledgeable on this.

foxik commented 2 years ago

Hi,

thanks for answer! The commit actually did remove functionality from TensorBoard callback -- the callback just sets correct tf.summary.record_if in https://github.com/keras-team/keras/blob/46121eed08d0feef743eacef3b66206df45cf656/keras/callbacks.py#L2362-L2373 but it relied on the write_scalar_summaries executed in the keras/engine/training.py, the one you removed. So once this "generic summary emitter" was removed, it broke the batch-level logs in TensorBoard callback.

But I agree that it was quite weird to collect logs this way, and reverting is not a good idea, instead the logs should be collected only in on_train_batch_end in the TensorBoard callback.

jonarchist commented 2 years ago

One thing to note is also that these logs were not actual batch-level summaries, which was a reason to remove them. They are accumulated from the beginning of the epoch to the current batch, since that is how Metrics work. They return the current cumulative average. So the documentation was misleading, and I don't think this type of summary is actually very helpful.

A more effective way to add true batch-level summaries for now is just to wrap everything in a summary writer context. Then you can also create batch-level summaries anywhere in the Model code for debugging.

For example:

debug_writer = tf.summary.create_file_writer(work_path + "/debug")
with debug_writer.as_default(step=model.step):
  model.fit(dataset)
debug_writer.flush()
jonarchist commented 2 years ago

It looks like the code you linked to could act as a replacement for the example I gave above, since it takes care of entering a writer context when training begins.

https://github.com/keras-team/keras/blob/46121eed08d0feef743eacef3b66206df45cf656/keras/callbacks.py#L2362-L2373

So to get true batch summaries, you'd just need to set update_freq to the desired number of batches and add tf.summary.scalar calls on whatever tensors you want to log.

jonarchist commented 2 years ago

We could consider to make the default Model.train_step implementation call tf.summary.scalar on every value that is added to a metric. Then the batch summaries would be correct.

foxik commented 2 years ago

One thing to note is also that these logs were not actual batch-level summaries, which was a reason to remove them. They are accumulated from the beginning of the epoch to the current batch, since that is how Metrics work. They return the current cumulative average. So the documentation was misleading, and I don't think this type of summary is actually very helpful.

I am not sure the documentation was explicitly stating that the recorded numbers were per-batch. Instead, I would expect the metrics to be aggregated, and I would expect the recorded numbers to be the same as the ones shown on the Progbar on the console during training (which are also averaged from the beginning of epoch, even loss). If you have freq=100, then the individual batch accuracies etc will fluctuate a lot, while an average from the beginning of the epoch might be more informative.

We could consider to make the default Model.train_step implementation call tf.summary.scalar on every value that is added to a metric. Then the batch summaries would be correct.

I do not think it is a good solution -- if a user modifies Model.train_step and uses TensorBoard callback, they would loose the logs, which seems counter-intuitive. If we want to collect raw batch logs, could it maybe happen in make_train_function?

But personally I would prefer to just call tf.summary.scalar in on_train_batch_end of the TensorBoard callback. Notably, I am currently using the following code in on_train_batch_end (to get batch summaries, so that I can track progress for runs which train a single epoch for a long time):

    if logs:
        for name, value in logs.items():
            tf.summary.scalar("batch_" + name, value, step=self._train_step)

Cheers!

haifeng-jin commented 2 years ago

TensorBoard does not log the batches metrics even when update_freq=N is set. Seems one question to answer first is: Do we want to log the aggregated metrics of only the N batches or log the aggregated of all batches from the start of the epoch? The fix is not clear yet. A related commit: https://github.com/keras-team/keras/commit/7d06227f3ef653a451fffe65d1d7dace7f7945df

jonarchist commented 2 years ago

Ok, I agree, it isn't a great idea to scatter this functionality across Model and the callback. The implementation is opaque and not easy to understand, and that was what prompted me to request removing it in the first place.

I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.

At the very least, "batch summary" to me implies "summary over the batch", so if the code produces cumulative summaries, that should be more clearly documented.

foxik commented 2 years ago

I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.

On second thought, at least for loss, I think you are right -- maybe I am too accustomed to the previous implementation :-)

But for stateful metrics, this quantity is never even computed (i.e., even inside train_step, only the aggregated values are available), and for some metrics, averaging batch-level values can be quite different than computing the metric on the whole data -- for example, if you compute let's say F1 in NER, sentences without entities gives you usually F1 of 100% (or 0%, because the quantity is not defined), but have no influence on the final precision & recall, so averaging them (e.g. via TensorBoard smoothing) is not a good idea.

So maybe the current value of loss, and whatever the stateful metrics compute?

But the loss returned from train_step is already averaged (tracked by the _loss_metric in the LossesContainer and returned by Model.compute_metrics). If we would want to avoid emitting the summaries directly from train_step, maybe the LossesContainer could be augmented to include the latest loss values and we could try obtaining them in the TensorBoard callback and replace the loss means if successful?

rchao commented 2 years ago

From Keras side we should ultimately continue to support per-batch logging of metrics to TensorBoard given the interests and backward-compatibility functionality wise (ideally, without the need of an overridden train_step() or call()).

I'm investigating into a solution and will post back as soon as I find something.

kyamagu commented 2 years ago

Hi, I've just see this issue. According to the thread, should I subclass the callback to implement the batch-wise collection like the following?

class CustomTensorBoard(tf.keras.callbacks.TensorBoard):  # type: ignore
    """TensorBoard callback with update_freq=N functionality."""

    def _implements_train_batch_hooks(self) -> bool:
        return super()._implements_train_batch_hooks() or isinstance(
            self.update_freq, int
        )

    def on_train_batch_end(
        self,
        batch: int,
        logs: Optional[Dict[str, float]] = None,
    ) -> None:
        super().on_train_batch_end(batch, logs)
        if batch % self.update_freq == 0 and logs is not None:
            with tf.summary.record_if(True), self._train_writer.as_default():
                for name, value in logs.items():
                    tf.summary.scalar("batch_" + name, value, step=self._train_step)
fperezgamonal commented 2 years ago

Hello all,

Just as @kyamagu, I run into this problem after migrating from TensorFlow 1 to the latest 2.9.1. In my case, the information is properly reported per batch in TensorBoard but only after the first epoch has elapsed. As a consequence, no information is reported for several minutes or even hours.

Depending on how long an epoch takes, this may be quite critical (although you can use Keras progress bar to identify early divergence or numerical problems).

More importantly, as a temporary fix, @kyamagu does the job and per-batch metrics are reported just after training starts. Thanks so much!

BenoitLeguay commented 2 years ago

Hello everyone, I also ran into the problem described above (update_freq='batch' in TensorBoard callback not taken into account). Thanks to @kyamagu solution I was able to solve this issue (Is it going on a release at some point ?)

However, as mentioned by @jonycgn:

I disagree on the cumulative summaries front – I find it useful to have instantaneous batch summaries to get a sense of how much the loss function fluctuates. You can always reduce the noise by using the smoothing functionality in TensorBoard. This is what that feature was designed for: It gives you running averages rather than arbitrarily resetting them each epoch, and you can even control the amount of smoothing in the UI after they are logged.

The cumulative aspect of Metrics is not really well-suited to plot scalar values. First, because of what @jonycgn explained (i.e the smoothing feature of TensorBoard should be the only responsible of how the plot looks like). But also because it creates large gap in the graph between epochs (as you can see in the image here below)

image

So, I'm wondering, is it possible to obtain the actual loss value at each batch (not the accumulated one) so I can make my own tf.summary.scalar(...) in train_step ? Here I'm not sure, since @jonycgn also said:

They are accumulated from the beginning of the epoch to the current batch, since that is how Metrics work. They return the current cumulative average.

Also, is the accumulated values in TensorBoard behavior is choice or is it destined to be fixed ?

tanzhenyu commented 2 years ago

I think all options are important: 1) per batch 2) every 1000 (user configured) batches 3) every epoch

Do remind that sometimes there is no concept of an epoch, and per batch logging is too expensive, then 2) becomes quite important

znorman-harris commented 2 years ago

Just adding some more user feedback, I also agree that having the immediate batch metrics is a key and important user experience when training. Not only can it help you catch, very quickly, if you might have done something wrong, it is nice to see that something is happening. For example: if it takes 40 minutes to process an epoch of data, I don't want to wait 40 minutes to see how my model is performing.

nkovela1 commented 2 years ago

I have added a section called "Batch-level logging" to a tutorial in the Tensorboard Github: https://github.com/tensorflow/tensorboard/blob/master/docs/scalars_and_keras.ipynb

I think the more common use case is addressed in the section titled "Instantaneous batch-level logging." I hope this guide is helpful to users who previously tried using update_freq.

hellbago commented 2 years ago

Is it possible to add the same custom update metric frequency also for the validation set? @kyamagu

kyamagu commented 2 years ago

@hellbago Probably you can override _implements_test_batch_hooks() and on_test_batch_end().

hellbago commented 2 years ago

I tried something like this but it didn't work:

def _implements_test_batch_hooks(self) -> bool:
        return super()._implements_test_batch_hooks() or isinstance(
            self.update_freq, int
        )

def on_test_batch_end(
        self,
        batch: int,
        logs: Optional[Dict[str, float]] = None,
    ) -> None:
        super().on_test_batch_end(batch, logs)
        if batch % self.update_freq == 0 and logs is not None:
            with tf.summary.record_if(True), self._val_writer.as_default():
                for name, value in logs.items():
                    tf.summary.scalar("batch_" + name, value, step=self._test_step)

@kyamagu

staubda commented 1 year ago

@nkovela1 is it also possible to extend your solution to update every update_freq batches?