tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.46k stars 74.17k forks source link

Error using TensorBoard callback in graph mode in TF 2.4 #44563

Closed drasmuss closed 1 week ago

drasmuss commented 3 years ago

System information

Describe the current behavior

Attempting to use tf.keras.callbacks.TensorBoard with eager execution disabled results in an error in TF 2.4.

Describe the expected behavior

There should be no error, callback should work as normal.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

tf.compat.v1.disable_eager_execution()

inp = tf.keras.Input((1,))
out = tf.keras.layers.Dense(1)(inp)

model = tf.keras.Model(inp, out)

model.predict(
    np.zeros((32, 1)),
    callbacks=[tf.keras.callbacks.TensorBoard(log_dir="test")],
)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Traceback (most recent call last):
  File ".../tmp.py", line 11, in <module>
    model.predict(
  File "...\site-packages\tensorflow\python\keras\engine\training_v1.py", line 982, in predict
    return func.predict(
  File "...\site-packages\tensorflow\python\keras\engine\training_arrays_v1.py", line 706, in predict
    return predict_loop(
  File "...\site-packages\tensorflow\python\keras\engine\training_arrays_v1.py", line 217, in model_iteration
    callbacks = cbks.configure_callbacks(
  File "...\site-packages\tensorflow\python\keras\callbacks.py", line 115, in configure_callbacks
    callback_list = CallbackList(callbacks)
  File "...\site-packages\tensorflow\python\keras\callbacks.py", line 237, in __init__
    self._should_call_train_batch_hooks = any(
  File "...\site-packages\tensorflow\python\keras\callbacks.py", line 238, in <genexpr>
    cb._implements_train_batch_hooks() for cb in self.callbacks)
  File "...\site-packages\tensorflow\python\keras\callbacks.py", line 2310, in _implements_train_batch_hooks
    return self._should_trace  # Only call batch hooks when tracing is enabled
AttributeError: 'TensorBoard' object has no attribute '_should_trace'
amahendrakar commented 3 years ago

Was able to reproduce the issue with TF v2.4.0rc0 and TF-nightly. However, code works fine with TF v2.3. Please find the attached gist. Thanks!

goldiegadde commented 3 years ago

@drasmuss thanks for filing the issue, is there a reason that you disable eager execution ?

drasmuss commented 3 years ago

I have found that using the old training_v1 Keras implementation (which is what you get when disabling eager execution) is still faster for some models.

tomerk commented 3 years ago

Hi @drasmuss can you share what the models are that you've found it's faster for? We want to make sure to fix up any performance regressions like that.

drasmuss commented 3 years ago

You can see an example here https://github.com/nengo/nengo-dl/blob/master/nengo_dl/tests/test_benchmarks.py#L238, specifically comparing the two test cases

(benchmarks.lmu(1000, 1, native_nengo=True), True, 100, True, 1.3, 1.5),
(benchmarks.lmu(1000, 1, native_nengo=True), True, 100, False, 1.05, 1.25),

The second case runs the same benchmark but with

tf.compat.v1.disable_eager_execution()
tf.compat.v1.disable_control_flow_v2()

and it runs about 0.25s (20%) faster.

drasmuss commented 3 years ago

Here is a Colab gist demonstrating the same thing if that helps https://colab.research.google.com/gist/drasmuss/45826df4e27dc6a21be961690d2a043f/performance-demo.ipynb

jvishnuvardhan commented 3 years ago

@drasmuss Is this still an issue? I ran your code (colab with out GPU) with tf-nightly and I see almost similar execution times as shown below. Please check the gist here. Thanks!

Please note that these results are without GPU

Execution time: 14.798911161999968
Eager 14.798911161999968

Execution time: 14.530144375999953
Non-eager 14.530144375999953
drasmuss commented 3 years ago

I think we need to test it on the GPU in order to see whether this is still an issue, it's hard to know whether the CPU results are indicative of GPU performance.

drasmuss commented 3 years ago

Had a chance to test this, and it looks like the problem is the same (possibly worse) in tf-nightly. Here are my results (note I couldn't get tf-nightly to run with GPU support on Colab, so this is on a local RTX 3090):

tf-nightly Eager: 0.8343444939237088 Non-eager: 0.6216731490567327 (~33% slowdown in eager mode)

Also note that tf-nightly seems to be significantly slower than older versions of tensorflow, e.g.

tensorflow 2.2.2 Eager: 0.7077119939494878 Non-eager: 0.5124576301313937 (~20% slowdown in tf-nightly vs tf 2.2; possibly related to https://github.com/tensorflow/tensorflow/issues/46515)

j-beaver commented 3 years ago

same issue for me with keras 2.3.1

from gym import Env
from gym.spaces import Discrete, Box

class FooEnv(Env):
    metadata = {'render.modes': ['human']}

    def __init__(self, training=True):
        self.training = training
        self.action_space = Discrete(5)
        self.size = 11
        self.observation_space = Box(low=np.array([0, 0]), high=np.array([self.size - 1, 1]))
        self.position = -1
        self.brightness = 0.0

        self.state = [self.position, self.brightness]
        self.action_steps = 30

        self.action = -1

        self.done = False
        self.reward = 0
        self.info = {}
        self.seed(seed=45)

    def get_position(self):
        noise_factor = 0
        position = 1 - (self.position / 5) * (self.brightness) + noise_factor
        return position

    def step(self, action, ext_position=-10):

        action -= 2
        self.action = action

        self.brightness += int(action) / 10
        if self.brightness > 1:
            self.brightness = 1
        if self.brightness < 0:
            self.brightness = 0

        if ext_position == -10:
            self.position += self.get_position()
        else:
            self.position = ext_position
        if self.position < 0:
            self.position = 0
        if self.position > self.size - 1:
            self.position = self.size - 1
        self.state = [self.position, self.brightness]

        if 10 > self.position >= 0:
            self.reward = 1 - abs(self.position / 10 - self.brightness)

        elif self.position >= 10:
            self.reward = -100

        self.action_steps -= 1

        self.done = self.check()

        self.info = {}
        return self.state, self.reward, self.done, self.info

    def check(self):
        done = self.done
        if self.action_steps == 0:
            done = True
        return done

    def reset(self):
        self.position = -1
        self.brightness = 0.0
        self.state = [self.position, self.brightness]
        self.action_steps = 30
        self.reward = 0
        self.action = -0
        self.done = False
        self.info = {}
        return self.state

    def render(self, mode='human', close=False):
        for i in range(self.size):
            if self.position >= i > self.position - 1:
                print("+", end='')
            else:
                print("-", end='')
        if self.done:
            print("X| Pos:" + str(self.position) + " Brightness:" + str(self.brightness) + " Done:" + str(self.done) +
                  " Reward:" + str(self.reward) + " Steps:" + str(self.action_steps) + " Action:" + str(self.action))
        else:
            print("O| Pos:" + str(self.position) + " Brightness:" + str(self.brightness) + " Done:" + str(self.done) +
                  " Reward:" + str(self.reward) + " Steps:" + str(self.action_steps) + " Action:" + str(self.action))

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

env = FooEnv()
env.seed(0)

states = env.observation_space.shape
actions = env.action_space.n

def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,) + states))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

from rl.agents import DQNAgent

from keras.callbacks import TensorBoard
from rl.callbacks import ModelIntervalCheckpoint, FileLogger
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory

model = build_model(states, actions)
model.summary()

def build_agent(model, actions):

    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1, value_min=0.1, value_test=0.05,
                                  nb_steps=500)
    memory = SequentialMemory(limit=10000, window_length=1)

    dqn = DQNAgent(model=model, memory=memory, policy=policy, enable_double_dqn=True,
                   nb_actions=actions, gamma=.98, nb_steps_warmup=100, target_model_update=1e-2)
    return dqn

callbacks = [TensorBoard(log_dir='./weights_test')]

dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=10000, log_interval=1000, nb_max_episode_steps=50, visualize=False, verbose=1,
        callbacks=callbacks)
sushreebarsa commented 3 years ago

Was able to reproduce the issue in TF v2.5,please find the gist here...Thanks !

oawxkw commented 11 months ago

I also observed the following alternative names of the API have the same behavior that the object has no attribute Exception is raised while eager execution is disabled.

This behavior still exists in tensorflow nightly (2.15.0-dev20230907), and users should be cautious when using them on both CPU and GPU.

Code to reproduce the issue in tf.compat.v1.keras.callbacks.TensorBoard ```python import numpy as np import tensorflow as tf print(tf.version.GIT_VERSION, tf.version.VERSION, flush=True) print(tf.config.list_physical_devices(), flush=True) tf.compat.v1.disable_eager_execution() inp = tf.keras.Input((1,)) out = tf.keras.layers.Dense(1)(inp) model = tf.keras.Model(inp, out) try: model.predict( np.zeros((32, 1)), callbacks=[tf.compat.v1.keras.callbacks.TensorBoard(log_dir="test")], ) except Exception as e: print("Failed! Error:", str(e), flush=True) else: print("Success!", flush=True) ``` On my GPU machine, the above code produces the following output, and no attribute error is raised. ```text v2.14.0-rc0-34-gdd01672d9a9 2.14.0-rc1 [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] Failed! Error: 'TensorBoard' object has no attribute '_should_trace' ``` This behavior is also reproducible on my CPU machine: ```text v2.14.0-rc0-34-gdd01672d9a9 2.14.0-rc1 [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')] Failed! Error: 'TensorBoard' object has no attribute '_should_trace' ```
tilakrayal commented 4 weeks ago

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 1 week ago

Are you satisfied with the resolution of your issue? Yes No