tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

[deeplab] Eval.py not showing any result #6567

Open RubenGarciaPaez opened 5 years ago

RubenGarciaPaez commented 5 years ago

System information

Describe the problem

Hello, my problem is that when trying to run eval.py it never shows any kind of result. Log says it starts evaluating but there is not any kind of feedback about what miou it has. I've tried to run local_test.sh and the mobilenet_v2 version and it happens with both scripts.

And also when I visualize the images using vis.py it seems the model hasn't trained that well because all the images are black with small color traces.

Source code / logs

INFO:tensorflow:Waiting for new checkpoint at /root/workspace/TFG_Code/TFG_Seman tic_segmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_voc_seg/exp/tr ain_on_trainval_set/train INFO:tensorflow:Found new checkpoint at /root/workspace/TFG_Code/TFG_Semantic_se gmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_voc_seg/exp/train_on _trainval_set/train/model.ckpt-10 INFO:tensorflow:Graph was finalized. 2019-04-12 12:39:07.213731: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-04-12 12:39:09.598712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 405] Found device 0 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1c:00.0 totalMemory: 11.91GiB freeMemory: 11.76GiB 2019-04-12 12:39:09.741244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 405] Found device 1 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:3f:00.0 totalMemory: 11.91GiB freeMemory: 11.76GiB 2019-04-12 12:39:09.884623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 405] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:40:00.0 totalMemory: 10.92GiB freeMemory: 10.77GiB 2019-04-12 12:39:10.046895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 405] Found device 3 with properties: name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.7335 pciBusID: 0000:1d:00.0 totalMemory: 15.90GiB freeMemory: 15.79GiB 2019-04-12 12:39:10.051374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 484] Adding visible gpu devices: 0, 1, 2, 3 2019-04-12 12:39:12.112522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 65] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-04-12 12:39:12.112567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 71] 0 1 2 3 2019-04-12 12:39:12.112577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 84] 0: N Y Y N 2019-04-12 12:39:12.112600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 84] 1: Y N Y N 2019-04-12 12:39:12.112608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 84] 2: Y Y N N 2019-04-12 12:39:12.112615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:9 84] 3: N N N N 2019-04-12 12:39:12.114313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 wit h 11378 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000: 1c:00.0, compute capability: 6.1) 2019-04-12 12:39:12.269795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 wit h 11378 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000: 3f:00.0, compute capability: 6.1) 2019-04-12 12:39:12.422934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 wit h 10421 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bu s id: 0000:40:00.0, compute capability: 6.1) 2019-04-12 12:39:12.582819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1 097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 wit h 15296 MB memory) -> physical GPU (device: 3, name: Quadro P5000, pci bus id: 0 000:1d:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from /root/workspace/TFG_Code/TFGSemantic segmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_vocseg/exp/train on_trainval_set/train/model.ckpt-10 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting evaluation at 2019-04-12-12:39:15 INFO:tensorflow:Visualizing on val set INFO:tensorflow:Performing single-scale test. INFO:tensorflow:Waiting for new checkpoint at /root/workspace/TFG_Code/TFG_Semantic_segmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train INFO:tensorflow:Found new checkpoint at /root/workspace/TFG_Code/TFG_Semantic_segmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10 INFO:tensorflow:Starting visualization at 2019-04-12-12:40:31 INFO:tensorflow:Visualizing with model /root/workspace/TFG_Code/TFG_Semantic_segmentation/src/CNN/DeepLab/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10 INFO:tensorflow:Graph was finalized. 2019-04-12 12:40:33.275125: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

And also a long message at the beggining regarding incomplete shapes on ops:

4 ops no flops stats due to incomplete shapes. Parsing Inputs...

=========================Options============================= -max_depth 10000 -min_bytes 0 -min_peak_bytes 0 -min_residual_bytes 0 -min_output_bytes 0 -min_micros 0 -min_accelerator_micros 0 -min_cpu_micros 0 -min_params 0 -min_float_ops 1 -min_occurrence 0 -step -1 -order_by float_ops -account_type_regexes . -start_name_regexes . -trim_name_regexes -show_name_regexes .* -hide_name_regexes -account_displayed_op_only true -select float_ops -output stdout:

I've looked for everywhere but i can not find anything regarding why this is happening. Thank you!

tensorflowbutler commented 5 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Bazel version CUDA/cuDNN version GPU model and memory

RubenGarciaPaez commented 5 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Bazel version CUDA/cuDNN version GPU model and memory

@tensorflowbutler Ok, it's done. This problem has happended to me both in Windows and in Ubuntu.

Ekko1992 commented 5 years ago

@RubenGarciaPaez Have you fixed it? I met the same problem. I'm using Ubuntu, and no results show up when running eval.py

RubenGarciaPaez commented 5 years ago

I haven't been able to check it yet but I have been told to try to change to the branch r1.12 accordingly to #4523

Ekko1992 commented 5 years ago

I'm using 1.12. So it seems that the version of tf is not the problem

RubenGarciaPaez commented 5 years ago

I am talking about the branch of the git not about the TF version if thats what you are referring to.

Ekko1992 commented 5 years ago

I have switched the branch to r1.12.0, but still met the same problem. I can successfully get some mask results when running vis.py though.

ShalamovRoman commented 5 years ago

Have similar problem, last output string is "INFO:tensorflow:Starting evaluation at ...". I'm trying to train on own dataset via Google Collab. Also vis.py works correctly.

RubenGarciaPaez commented 5 years ago

Finally able to check it and as @Ekko1992 said problem stills there. Don't know why this happens. I don't know if anyone has the same issue but all event files generated by eval.py have the same size. (6861 KB).

supersai007 commented 5 years ago

Have a similar problem, as @ShalamovRoman. After eval.py, the last output string is "INO:tensorflow:Starting evaluation at ...". After a long time, it starts vis.py without showing whether evaluation ended or evaluating at batch in between. Even mIOU values are not shown.

RubenGarciaPaez commented 5 years ago

Okey so, in r1.1 even tough it does not show any kind of feedback while performing the evaluation. If you try to visualize the events file eval.py generates with tensorboard it does show the miou.

SorourMo commented 5 years ago

I have the same problem. eval.py does not show the mIOU. The last line of output is INFO:tensorflow:Starting evaluation at 2019-05-10-23:44:40 and then it exits with no output. vis.py works properly though. I am using Tensorflow 1.13.1 and master branch.

supersai007 commented 5 years ago

As @RubenGarciaPaez said, mIoU values are logged into tensorboard event log file. We can find those values by parsing or visualizing the tags in log file. However, eval.py doesn't show whether the evaluation ended or not. Do not why?

AmeetR commented 5 years ago

I'm also having this issue now, any thoughts?

FrancisDacian commented 5 years ago

Still, have this kind of problem, any solution now?

AmeetR commented 5 years ago

idk, how would I go about visualizing the tags in log file? the event files aren't human readable so unsure how to look at it?

EG101 commented 5 years ago

I just put

tensorboard --logdir=${EVAL_LOGDIR}

at the end of my script without any indentation.

As per Deeplab official tutorial, tensorboard is supposed to be given the path to train, eval, and vis directories, however this didn't work. If anyone comes up with a better solution please share.

https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/cityscapes.md

behnamnkp commented 5 years ago

Same problem. I checked on several datasets including the samples available on the website.

mei123hao commented 5 years ago

I HAVE THE SAME PROBLEM , it always waitng at INFO:tensorflow:Starting evaluation at 2019-11-04-02:31:03. how to resolves it ? help!!!

shanyucha commented 5 years ago

@mei123hao Try to comment the line " eval_ops=list(metrics_to_updates.values())," in eval.py. That would fix the problem. checkpoint_dir=FLAGS.checkpoint_dir, master=FLAGS.master,

eval_ops=list(metrics_to_updates.values()),

    max_number_of_evaluations=num_eval_iters,
    hooks=hooks,
    eval_interval_secs=FLAGS.eval_interval_secs)