tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.7k stars 1.66k forks source link

Graph visualization failed: GraphDefs failed to reconcile. #1929

Open stephanwlee opened 5 years ago

stephanwlee commented 5 years ago

In TensorFlow v2, below code can cause GraphDef reconciliation error.

@tf.function
def foo(x):
  return x ** 2

with writer.as_default():
  tf.summary.trace_on()
  foo(1)
  foo(2)
  tf.summary.trace_export("foo")

Depending on the argument, tf.function (really, auto-graph) creates ops that are unique within GraphDef but is not globally unique. In the example above, two GraphDefs (on from foo(1) and another from foo(2)) will be written out and they can collide badly in names and content.

In such case, instead of showing wrong graph content, TensorBoard throws an error.

n-chennakeshava commented 5 years ago

I get the same error when I have multiple "@tf.function". I am working on a distributed learning project across multiple GPUs. I have one @tf.function for the train loop, and another for the test loop.

with strategy.scope():
    @tf.function
    def distributed_train_step(dataset_inputs):
          (...)
    @tf.function
    def distributed_test_step(dataset_inputs):
          (...)

    stamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    logdir = 'logs/func/%s' % stamp
    writer = tf.summary.create_file_writer(logdir)
    tf.summary.trace_on(graph=True)

.
.
.
.

with writer.as_default():
    tf.summary.trace_export(name="my_func_trace",step=0)
stephanwlee commented 5 years ago

How are you invoking your tf.functions? Is it writing to the same writer? If so, this is working as intended. Two tf.functions have graphdefs which may have the same node name but of different type/metadata.

luvwinnie commented 4 years ago

@stephanwlee If we use multiple Gpus which have train_step and test_step tf.functions, how should we resolve this problem? I am facing the same problem which shows it has below errors.

Traceback (most recent call last):
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graph_util.py", line 118, in combine_graph_defs
    lambda n: n.name)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graph_util.py", line 85, in _safe_copy_proto_list_values
    raise _SameKeyDiffContentError(key)
tensorboard.plugins.graph.graph_util._SameKeyDiffContentError: sparse_categorical_crossentropy/Shape

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graphs_plugin.py", line 225, in graph_route
    result = self.graph_impl(run, tag, is_conceptual, limit_attr_size, large_attrs_key)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graphs_plugin.py", line 169, in graph_impl
    graph_util.combine_graph_defs(graph, func_graph.pre_optimization_graph)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graph_util.py", line 124, in combine_graph_defs
    'but contents are different: %s') % exc)
ValueError: Cannot combine GraphDefs because nodes share a name but contents are different: sparse_categorical_crossentropy/Shape

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/werkzeug/serving.py", line 304, in run_wsgi
    execute(self.server.app)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/werkzeug/serving.py", line 292, in execute
    application_iter = app(environ, start_response)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/backend/application.py", line 164, in wrapper
    return wsgi_app(*args)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/backend/application.py", line 419, in __call__
    return self.exact_routes[clean_path](environ, start_response)
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/werkzeug/wrappers/base_request.py", line 237, in application
    resp = f(*args[:-2] + (request,))
  File "/home/usr1/.virtualenvs/tensorflow-2.1/lib/python3.7/site-packages/tensorboard/plugins/graph/graphs_plugin.py", line 227, in graph_route
    return http_util.Respond(request, e.message, 'text/plain', code=400)
AttributeError: 'ValueError' object has no attribute 'message'
E1204 12:24:01.568600 140642150078208 directory_watcher.py:242] File model_dir/logs-new/20191204-122321/events.out.tfevents.1575429801.pc-01.36116.63.v2 updated even though the current file is model_dir/logs-new/20191204-122321/events.out.tfevents.1575429824.pc-01.profile-empty
naturomics commented 4 years ago

Any update?

jmerizia commented 4 years ago

The error message AttributeError: 'ValueError' object has no attribute 'message' isn't very helpful. There's a bug in the error output: https://github.com/tensorflow/tensorboard/blob/1780833b30d953509200bf9560be2ba42fabe9ff/tensorboard/plugins/graph/graphs_plugin.py#L323 should be:

return http_util.Respond(request, str(e), 'text/plain', code=400)

However, that only gets us a step closer. Running the original code, the actual error message (that Tensorboard should, but doesn't propagate to the UI) is: Cannot combine GraphDefs because nodes share a name but contents are different: Const

As @stephanwlee mentioned, this is a GraphDef naming collision.

I think the simplest fix around this would be to call trace_on/trace_export separately around each graph call. So do something like this:

import tensorflow as tf

writer = tf.summary.create_file_writer('ex_logs')

@tf.function
def foo(x):
    return x ** 2

with writer.as_default():
    tf.summary.trace_on()
    foo(1)
    tf.summary.trace_export("foo1", step=0)

with writer.as_default():
    tf.summary.trace_on()
    foo(2)
    tf.summary.trace_export("foo2", step=0)

Note that trace_export will also stop tracing (https://www.tensorflow.org/api_docs/python/tf/summary/trace_on?version=stable)

This ensures that each trace is separately tagged. This is a debugging tool for visualizing the network graph, and it makes sense that you'd want to profile just a single call of the graph. Tracing is something I'd imagine you wouldn't want to leave on while training, as profiling is expensive anyways.

qo4on commented 4 years ago

This official tutorial in Colab returns an error when I choose keras or batch_2 tag: image Download PNG button doesn't work also: image

zhang12300 commented 4 years ago

same problem

markelsanz14 commented 4 years ago

I would suggest exporting them as different traces with different names. That seems to work for me.

Instead of this:

with writer.as_default():
  tf.summary.trace_on()
  foo(1)
  foo(2)
  tf.summary.trace_export("foo")

Do this:

with writer.as_default():
  tf.summary.trace_on()
  foo(1)
  tf.summary.trace_export("foo1")
  tf.summary.trace_on()
  foo(2)
  tf.summary.trace_export("foo2")
Mr-tooth commented 4 years ago

I can hardly recognize the location of the error in my code.

Harrypotterrrr commented 4 years ago

any update about this issue? It has been more than one year since the issue was put forward 😢

GuillaumeMougeot commented 4 years ago

I had the same issue. Tensorboard needs unique names to be given to the graph variables (I don't why and I hope this issue will be fixed). In your case this piece of code should fix it:

import tensorflow as tf 

@tf.function
def foo(x):
  return x ** 2

writer=tf.summary.create_file_writer('logs\\')
with writer.as_default():
  tf.summary.trace_on()
  foo(tf.Variable(1, name='foo1')) # define a unique name for the variable
  foo(tf.Variable(2, name='foo2'))
  tf.summary.trace_export("foo", step=0)

This issue also exists when overriding tf.Module. Then, self.name_scope (or tf.name_scope) can be used when defining the module variables (wrapping the other operations or not). Here is an example of a custom Dense layer:

import tensorflow as tf 
import numpy as np

class Dense(tf.Module):
 #  Fully-connected layer.
 def __init__(self, out_fmaps, name=None):
  super().__init__(name=name)
  self.is_built = False
  self.out_fmaps = out_fmaps

def __call__(self, x):
 if not self.is_built:
  with self.name_scope: # Creates the variable under name_scope
   he_init = np.sqrt(2/x.shape[-1])
   init_val = tf.random.normal([x.shape[-1], self.out_fmaps])*he_init
   self.w = tf.Variable(init_val, name='dense')
  self.is_built = True
 return tf.matmul(x, self.w)
lixiaojun2914 commented 3 years ago

In TensorFlow v2, below code can cause GraphDef reconciliation error.

@tf.function
def foo(x):
  return x ** 2

with writer.as_default():
  tf.summary.trace_on()
  foo(1)
  foo(2)
  tf.summary.trace_export("foo")

Depending on the argument, tf.function (really, auto-graph) creates ops that are unique within GraphDef but is not globally unique. In the example above, two GraphDefs (on from foo(1) and another from foo(2)) will be written out and they can collide badly in names and content.

In such case, instead of showing wrong graph content, TensorBoard throws an error.

you could delete other tf.function state and run all steps in one function by function call to resolve this problem.

def foo(x):
  return x ** 2

@tf.function
def foooo(x1, x2):
  foo(x1)
  foo(x2)

with writer.as_default():
  tf.summary.trace_on()
  foooo(1, 2)
  tf.summary.trace_export("foo")
akewarmayur commented 3 years ago

Capture Getting this error while visualizing the .pb model. I have created "events.out.tfevents.1621934261.6e85c43ac415" file from following code.

model_filename = 'model.pb'
import tensorflow as tf
from tensorflow.python.platform import gfile
with tf.Session() as sess:
    with gfile.FastGFile(model_filename, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        g_in = tf.import_graph_def(graph_def)
LOGDIR='op'
train_writer = tf.summary.FileWriter(LOGDIR)
train_writer.add_graph(sess.graph)
ConstantinVasilev commented 2 years ago

Could someone please advise for solving the malformed Op graph error that I am getting?

While running a custom Keras model with tensorboard callback. The Conceptual graph is generated, however, the Op graph returns: Error: Malformed GraphDef. I tried some existing suggestions related to potential naming conflicts and using name_scope, however, to no avail.

logdir = 'logs/func/' + datetime.now().strftime("%Y%m%d-%H%M%S")

class MCLayer(tf.keras.layers.Layer):
  def __init__(self, name=None):
    super(MCLayer, self).__init__(name=name)
    #with tf.name_scope('test1'): 
    self.nT = tf.constant(400)
    self.n = tf.constant(100000)
    self.dt = tf.constant(1/365)
    self.drift = tf.constant(0.08)
    self.sigma = tf.constant(0.1)

  #@tf.function
  def call(self, inputs):
    #with tf.name_scope('test2'):    
    dWt = tf.random.normal(mean=0, stddev=tf.math.sqrt(self.dt), shape=[self.nT, self.n])
    dYt = self.drift*self.dt + self.sigma*dWt
    C = tf.cumsum(dYt, axis=0)
    S = tf.exp(C)
    A = tf.reduce_mean(S, axis=0)
    P = tf.reduce_mean(tf.maximum(A - inputs, 0))
    return P

input_layer = tf.keras.layers.Input(shape=(1), name='input_layer')
output_layer = MCLayer(name='output_layer')(input_layer)

model = tf.keras.models.Model(input_layer, output_layer, name='SomeModel')
model.compile()
result = model.predict(tf.constant(1.0, shape=(1,)), callbacks=[tf.keras.callbacks.TensorBoard(log_dir=logdir)])
AndreyOrb commented 2 years ago

I got a Keras SavedModel from another data scientist and want to see the graph in TensorBoard. I faced the same Malformed GraphDef issue, using TF 2.8

import tensorflow as tf
import numpy as np

model = tf.keras.models.load_model(model_path)

tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir='C:/log_tf2',
    update_freq=1,
    histogram_freq=1,
    write_graph=True,
    write_images=True
)
tensorboard_callback.set_model(model)

result = model.predict(
    {
        'a': tf.constant(np.random.rand(1))
    },
    callbacks=[tensorboard_callback],
    verbose=1)

A new "train" folder was created in log_dir, containing only one tiny events.out.tfevents.XXXXXXXX.v2 file (14KB)

TF1 usually produced a log directory with big log of a model (the size was compatible to a size of a frozen graph).

nfelt commented 2 years ago

Sorry to hear you're having trouble with this. However, we won't really be able to debug these without more detail about the actual graphdef that caused the issue, preferably as a graphdef pbtxt or an events.out.tfevents file. If the graph is sensitive and can't be shared we unfortunately won't be able to get much farther, but you can try looking at the Javascript Console to see if there is any more detail about the error message there.

As for the differences between TF1 and TF2 file sizes, that might be expected depending on the graph contents - again, it would be hard to say anything more without knowing the specific graph and the specific files in question.

raphi1790 commented 2 years ago

I was facing this issue as well. Found out that switching the verbose-type from 2 to 1 in the model.fit()-function solved the problem. This might help somebody here, too. Since I'm unsure if this behaviour is intended, I created a issue for it (see here: https://github.com/tensorflow/tensorboard/issues/5745).

AndreyOrb commented 2 years ago

I was facing this issue as well. Found out that switching the verbose-type from 2 to 1 in the model.fit()-function solved the problem. This might help somebody here, too. Since I'm unsure if this behaviour is intended, I created a issue for it (see here: #5745).

Switching the verbose from 2 to 1 resolved the problem. Thanks!

monokle commented 1 year ago

I was facing this issue as well. Found out that switching the verbose-type from 2 to 1 in the model.fit()-function solved the problem. This might help somebody here, too. Since I'm unsure if this behaviour is intended, I created a issue for it (see here: #5745).

Thank you!! Based on your comment, I switched from verbose type 0 to 1 which resolved the issue as well.