google / automl

Google Brain AutoML
Apache License 2.0
6.2k stars 1.44k forks source link

weights in ckpt != weights in pb #314

Closed vicwer closed 4 years ago

vicwer commented 4 years ago

when I use freeze_model to save a frozen.pb file, the weights in frozen.pb are not equal weights in ckpt file. how to get the correct frozen.pb, and the pb does not contain preprocess and postprocess nodeOP?

mingxingtan commented 4 years ago

If you want the frozen graph to include preprocessing and postprocessing, please use saved_model:

python model_inspect.py --runmode=saved_model --model_name=efficientdet-d0 \
  --ckpt_path=efficientdet-d0 --saved_model_dir=/tmp/benchmark/

you will have both saved_model and a frozen pb file "efficientdet-d0_frozen.pb" under /tmp/benchmark/

mingxingtan commented 4 years ago

What do you mean by the weights in frozen.pb are not equal weights in ckpt file?

vicwer commented 4 years ago

@mingxingtan 您好,我只想保存一个不包含预处理和后处理,并且移除训练节点的固化模型,一开始我在freeze_model()函数中添加了

constant_graph = tf.graph_util.remove_training_nodes(graphdef)
      with tf.gfile.GFile('./test2.pb', mode='wb') as f:
        f.write(graphdef.SerializeToString())

这段代码去保存,但是在inference时结果是错误的,于是我打印了ckpt里第一层conv的权重和我保存的pb里的权重,我发现不一致,然后我生成了好几次pb,发现权重值是变化的,我应该如何生成这个只包含网络模型的pb文件?请指导以下!!

mingxingtan commented 4 years ago

Oh, you meant saved_model will generate random weights for frozen graph? that could be a bug, I will take a look tmr.

vicwer commented 4 years ago

@mingxingtan @mingxingtan when use --runmode=freeze_model, the weights in output_frozen.pb are random.

d61h6k4 commented 4 years ago

These happened for me as well and I fixed it by adding restore_model into the function build_and_save_model.

When we use --runmode=freeze_model first we call build_and_save_model where we build and save the model (but don't restore) and after freeze_model which restores from checkpoint that was created in previous step.

vicwer commented 4 years ago

These happened for me as well and I fixed it by adding restore_model into the function build_and_save_model.

When we use --runmode=freeze_model first we call build_and_save_model where we build and save the model (but don't restore) and after freeze_model which restores from checkpoint that was created in previous step.

  def build_and_save_model(self):
    """build and save the model into self.logdir."""
    with tf.Graph().as_default(), tf.Session() as sess:
      # Build model with inputs and labels.
      inputs = tf.placeholder(tf.float32, name='input', shape=self.inputs_shape)
      outputs = self.build_model(inputs, is_training=False)

      self.restore_model(
          sess, self.ckpt_path, self.enable_ema, self.export_ckpt)
      # Run the model
      inputs_val = np.random.rand(*self.inputs_shape).astype(float)
      labels_val = np.zeros(self.labels_shape).astype(np.int64)
      labels_val[:, 0] = 1
      sess.run(tf.global_variables_initializer())
      # Run a single train step.
      sess.run(outputs, feed_dict={inputs: inputs_val})
      all_saver = tf.train.Saver(save_relative_paths=True)
      all_saver.save(sess, os.path.join(self.logdir, self.model_name))

      tf_graph = os.path.join(self.logdir, self.model_name + '_train.pb')
      print(tf_graph)
      with tf.io.gfile.GFile(tf_graph, 'wb') as f:
        f.write(sess.graph_def.SerializeToString())

  def restore_model(self, sess, ckpt_path, enable_ema=True, export_ckpt=None):
    """Restore variables from a given checkpoint."""
    sess.run(tf.global_variables_initializer())
    checkpoint = tf.train.latest_checkpoint(ckpt_path)
    if enable_ema:
      ema = tf.train.ExponentialMovingAverage(decay=0.0)
      ema_vars = utils.get_ema_vars()
      var_dict = ema.variables_to_restore(ema_vars)
      ema_assign_op = ema.apply(ema_vars)
    else:
      var_dict = utils.get_ema_vars()
      ema_assign_op = None

    tf.train.get_or_create_global_step()
    sess.run(tf.global_variables_initializer())
    saver = tf.train.Saver(var_dict, max_to_keep=1)
    saver.restore(sess, checkpoint)

    if export_ckpt:
      print('export model to {}'.format(export_ckpt))
      if ema_assign_op is not None:
        sess.run(ema_assign_op)
      saver = tf.train.Saver(max_to_keep=1, save_relative_paths=True)
      saver.save(sess, export_ckpt)

  def freeze_model(self) -> Tuple[Text, Text]:
    """Freeze model and convert them into tflite and tf graph."""
    with tf.Graph().as_default(), tf.Session() as sess:
      inputs = tf.placeholder(tf.float32, name='input', shape=self.inputs_shape)
      outputs = self.build_model(inputs, is_training=False)

      checkpoint = tf.train.latest_checkpoint(self.logdir)
      logging.info('Loading checkpoint: %s', checkpoint)
      sess.run(tf.global_variables_initializer())
      saver = tf.train.Saver()

      # Restore the Variables from the checkpoint and freeze the Graph.
      saver.restore(sess, checkpoint)

      output_node_names = [node.name.split(':')[0] for node in outputs]
      graphdef = tf.graph_util.convert_variables_to_constants(
          sess, sess.graph_def, output_node_names)
      constant_graph = tf.graph_util.remove_training_nodes(graphdef)
      with tf.gfile.GFile('./test2.pb', mode='wb') as f:
        f.write(graphdef.SerializeToString())

    return graphdef

like this? I got bad pb file again, how to fix it ...

d61h6k4 commented 4 years ago

like this? I got bad pb file again, how to fix it ...

yes, like this. Can you show me how you check that pb is bad? I will try on my machine.

vicwer commented 4 years ago

like this? I got bad pb file again, how to fix it ...

yes, like this. Can you show me how you check that pb is bad? I will try on my machine.

the first conv weights in ckpt is :

[-4.87630069e-03  1.91047490e-01 -1.95671692e-01  1.71537027e-01
     1.34478956e-01  6.36250526e-03 -9.22628492e-02  3.57320942e-02
    -4.80589449e-01 -2.97820047e-02  2.64668837e-02 -5.60815260e-03
     1.07843883e-01 -1.27487212e-01 -4.54808846e-02 -1.90424174e-01
    -5.76257929e-02 -1.19322196e-01  1.36956245e-01 -8.02704543e-02
     4.42219973e-02  4.19331610e-01  6.55569062e-02  5.64435497e-03
    -1.54584832e-02 -4.09075379e-01 -3.01529542e-02  1.83883786e-01
     0.00000000e+00  8.66569579e-04  2.26040184e-03 -2.51607209e-01]
   [-7.85537064e-03  2.65666872e-01 -3.17041487e-01  2.15058267e-01
    -1.01113804e-02  6.40092045e-02 -1.23095065e-01 -9.75322127e-02
     3.03800255e-01 -1.64337233e-02 -4.44775224e-02  3.06249857e-02
    -3.98499578e-01  4.63367030e-02  2.40166113e-02 -2.93613136e-01
    -7.68478364e-02  3.95949394e-01  2.20182478e-01 -9.00086984e-02
     1.49498731e-02 -3.33532870e-01  3.93319838e-02  8.59814063e-02
    -3.74482721e-02 -5.93542755e-01 -2.42474973e-02  3.13305110e-01
     5.58793545e-09 -2.71810889e-02  5.64225018e-04 -3.97311747e-01]
   [-3.93877923e-03  9.04522687e-02 -1.20885849e-01  1.38370708e-01
    -4.04852629e-02 -2.61897966e-02  5.66635281e-03  5.76920621e-02
     2.13139594e-01 -6.35958761e-02 -1.51988193e-01 -4.09245640e-02
     2.97569394e-01  6.15806207e-02  1.48216933e-02 -1.18639633e-01
    -4.63466793e-02 -2.48116553e-01  3.81280184e-02 -1.56261504e-01
     3.79605144e-02 -8.86044875e-02  1.43827759e-02  2.19335556e-02
    -4.62291390e-03 -2.35655993e-01 -4.72586416e-03  5.23514487e-02
     7.45058060e-09  1.19470134e-02 -2.49268599e-02 -1.17885768e-01]]...

the first conv weights in fronzen.pb by freeze_model is:

[ 1.08125947e-01  3.19452435e-02  1.06105343e-01  1.18609518e-01
    -6.08795695e-03 -3.26514803e-02 -2.14004964e-02 -1.39149874e-01
     8.97422731e-02 -9.36664641e-02  1.20425388e-01 -2.73718610e-02
    -7.33160526e-02  6.32562935e-02  1.62048489e-01  4.39646989e-02
    -7.42008612e-02 -2.33811028e-02 -1.02975719e-01  2.02117667e-01
    -1.42768174e-01  4.59212735e-02  8.83340389e-02 -6.30375147e-02
     1.35231346e-01  1.93540957e-02 -2.39386499e-01 -8.62637758e-02
    -6.93833902e-02  2.16089543e-02  1.94270432e-01 -3.26626189e-02]
   [ 1.95291221e-01 -5.66987544e-02  1.03582345e-01  9.95950997e-02
     1.95503212e-03 -1.46431267e-01 -2.22153887e-02 -2.48681288e-02
     1.04380846e-01 -1.05247252e-01  7.66960010e-02 -4.96273339e-02
     8.13698396e-02 -8.98341357e-04 -2.44052848e-03 -3.53747234e-02
    -6.96010608e-03  2.13769823e-02  4.10276465e-02 -5.72910421e-02
    -4.68641222e-02 -3.37347649e-02  2.18234062e-01 -5.70717007e-02
     8.68638754e-02  1.43651869e-02  2.72786058e-03  7.66072422e-03
     5.13708703e-02  6.32534623e-02  3.28366086e-02  2.56236419e-02]
   [ 2.42580883e-02  9.87463444e-02  8.40714388e-03  1.82739556e-01
    -3.20910737e-02  8.69490132e-02 -3.43996137e-02  1.31102338e-01
    -9.05711800e-02  4.10678908e-02 -1.27581343e-01 -1.34813607e-01
     5.65398932e-02 -7.53058959e-03  1.07006781e-01  4.35651504e-02
     7.01373965e-02  1.64547861e-01  1.88628182e-01 -1.40261296e-02
     9.09864232e-02  8.23299661e-02  2.43235044e-02  6.13936447e-02
    -1.27118155e-01 -2.77505815e-02 -2.21706461e-02  7.59808570e-02
     1.87907904e-01 -5.63655198e-02 -2.97567975e-02  2.00874433e-02]]...

can u show your fixed code?

d61h6k4 commented 4 years ago

I don't have access to my work machine, that why I can't show the code right now, but I can do it tomorrow.

vicwer commented 4 years ago

you can check your weights of pb file.

from tensorflow.python.platform import gfile
from tensorflow.python.framework import tensor_util
from tensorflow.core.framework import graph_pb2
graph_path = '../test.pb'

def values_from_const(node_def):
    if node_def.op != "Const":
        raise ValueError("Node named '%s' should be a Const op for values_from_const." % node_def.name)
    input_tensor = node_def.attr["value"].tensor
    tensor_value = tensor_util.MakeNdarray(input_tensor)
    return tensor_value

def read_pb():
    input_graph_def = graph_pb2.GraphDef()
    with gfile.Open(graph_path, "rb") as f:
        data = f.read()
        input_graph_def.ParseFromString(data)

    for node in input_graph_def.node:
        print(node.name)
        print(node.op)
        if node.op == "Const":
            if 'efficientnet-b0/stem/conv2d/kernel' in node.name:
                weight = values_from_const(node)
                print(weight.shape)
                print(weight)

if __name__ == "__main__":
    read_pb()

if it is same with my ckpt weights log, your convertion is correct, and please tell me how to do it, thanks.

vicwer commented 4 years ago

@d61h6k4 did you save the pb model which not include preprocess and postprocess ?

vicwer commented 4 years ago

I have no idea, and I don't know the reason... I add the frozen code at inference.py and freeze the model by --runmode=infer, and it is successful.

If your code is successful, pls show me, thank you~ @d61h6k4 @mingxingtan

d61h6k4 commented 4 years ago

Now it works for me, I restore model not in build_and_saved_model now, but in freeze_model. Patch is here:

diff --git a/efficientdet/model_inspect.py b/efficientdet/model_inspect.py
index 390e8ba..a05fbbe 100644
--- a/efficientdet/model_inspect.py
+++ b/efficientdet/model_inspect.py
@@ -325,14 +325,8 @@ class ModelInspector(object):
     with tf.Graph().as_default(), tf.Session() as sess:
       inputs = tf.placeholder(tf.float32, name='input', shape=self.inputs_shape)
       outputs = self.build_model(inputs, is_training=False)
-
-      checkpoint = tf.train.latest_checkpoint(self.logdir)
-      logging.info('Loading checkpoint: %s', checkpoint)
-      saver = tf.train.Saver()
-
-      # Restore the Variables from the checkpoint and frozen the Graph.
-      saver.restore(sess, checkpoint)
-
+      self.restore_model(sess, self.ckpt_path, self.enable_ema)
+      
       output_node_names = [node.op.name for node in outputs]
       graphdef = tf.graph_util.convert_variables_to_constants(
           sess, sess.graph_def, output_node_names)

@d61h6k4 did you save the pb model which not include preprocess and postprocess ?

yes

mingxingtan commented 4 years ago

I updated model_inspect, the weights should be loaded from checkpoint if you specify --ckpt_path=xxx

Thanks for the idea, @d61h6k4 !