Sync "https://github.com/jakeret/tf_unet/pull/202" with master and resolve conflicts

ashahba commented 5 years ago

When using the image gcr.io/deeplearning-platform-release/tf-cpu.1-14 and while following this steps: https://github.com/IntelAI/models/blob/v1.4.0/benchmarks/image_segmentation/tensorflow/unet/README.md I get the following error:

2019-07-09 17:03:43.718942: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2019-07-09 17:03:43.771372: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
W0709 17:03:43.853285 139741926975296 deprecation_wrapper.py:119] From /workspace/models/tf_unet/unet.py:301: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W0709 17:03:43.874177 139741926975296 deprecation.py:323] From /root/miniconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
  File "/workspace/benchmarks/image_segmentation/tensorflow/unet/inference/fp32/unet_infer.py", line 78, in <module>
    prediction = net.predict(arg_parser.parse_args().ckpt_path, x_test)
  File "/workspace/models/tf_unet/unet.py", line 274, in predict
    self.restore(sess, model_path)
  File "/workspace/models/tf_unet/unet.py", line 302, in restore
    saver.restore(sess, model_path)
  File "/root/miniconda3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1278, in restore
    compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: /checkpoints/model.cpkt
Ran inference with batch size 1
Log location outside container: /jenkins/workspace/Intel-Models-Benchmark-fp32-Trigger/intel-models/benchmarks/common/tensorflow/logs/benchmark_unet_inference_fp32_20190709_170331.log
nrvalgo_jenkinsadm@aipg-fm-skx-48:/jenkins/workspace/Intel-Models-Benchmark-fp32-Trigger/intel-models/benchmarks$ ls $CHECKPOINT_DIR/
checkpoint  events.out.tfevents.1548972182.4e4b03cdde24  model.ckpt.data-00000-of-00001  model.ckpt.index  model.ckpt.meta

ashahba commented 5 years ago

@jakeret this is basically just bringing #202 up to date with master. I also realized the issue with https://github.com/IntelAI/models/blob/v1.4.0/benchmarks/image_segmentation/tensorflow/unet/README.md was that I was using checkpoint_name=model.cpkt not realizing that it's now checkpoint_name=model.ckpt and I fixed our docs.

Thanks.

ashahba commented 5 years ago

@mpjlu would you also please review and provide feedback if needed.

Thanks.

jakeret commented 5 years ago

hi @ashahba , thank you for your contribution. I wasn't aware that this repo is being used in IntelAI benchmarks, nice.

I hadn't merged #202 because of two reasons

the thread handling should not be part of the PR as it has nothing to do with the dropout
In my understanding if we set e.g. keep_prop != 1 e.g. 0.5 it can't be changed for validation or prediction (where we don't want any regularization) as it is a fix part of the graph. Or am I missing something?

ashahba commented 5 years ago

Thanks @jakeret That sounds great. In the meantime I'm unblocked right now but I keep my eyes open for the any activity on #202

jakeret / tf_unet

Sync "https://github.com/jakeret/tf_unet/pull/202" with master and resolve conflicts #276