kwotsin / TensorFlow-ENet

TensorFlow implementation of ENet
MIT License
257 stars 123 forks source link

How to train it on another data set? how can I handle checkpoint? #18

Open changlinzhang opened 6 years ago

changlinzhang commented 6 years ago

Hi, kwotsin! Thanks for your work. I want to train it on another data set (class number is 30 instead of 12). I thought I had changed related codes. But I met this error: 2018-01-11 17:23:22.187077: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000 I thought it may be caused by checkpoint? How can I deal with this problem?

The completed information is as follow: ========= Median Frequency Balancing Class Weights =========
[6.397542327061094e-05, 6.7097626201794152e-05, 0.024400273767542283, 0.041269401614453756, 5.5506352412896832e-05, 0.076635711324892844, 0.069381256179271614, 3.472654196521944e-05, 0.00042760164428717635, 0.00012440287198120186, 0.090233329139976615, 0.12489918060211183, 0.0013708685331902757, 6.0827765291491662e-05, 0.073240128809290553, 0.35775514055273316, 0.64257341685305103, 0.90968868010977944, 0.37688909228806228, 0.44248634385452756, 0.00042529101230680852, 0.30566376891079095, 0.28941152643298945, 3.9464190165066867e-05, 0.26421036878629223, 0.42250536299160169, 0.5089356784417215, 0.00024742224929701886, 0.47265314480960613, 0.0]
2018-01-11 17:22:23.528595: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:23.528689: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:23.528720: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:29.254935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K40c major: 3 minor: 5 memoryClockRate (GHz) 0.745 pciBusID 0000:02:00.0 Total memory: 11.17GiB Free memory: 11.10GiB 2018-01-11 17:22:29.503633: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1e106f80 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-01-11 17:22:29.504523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-01-11 17:22:29.505315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: name: Tesla K40c major: 3 minor: 5 memoryClockRate (GHz) 0.745 pciBusID 0000:84:00.0 Total memory: 11.17GiB Free memory: 11.10GiB 2018-01-11 17:22:29.505448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 1 2018-01-11 17:22:29.505491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 0 2018-01-11 17:22:29.505540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2018-01-11 17:22:29.505685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y N 2018-01-11 17:22:29.505705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: N Y 2018-01-11 17:22:29.505740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:02:00.0) 2018-01-11 17:22:29.505779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:84:00.0) 2018-01-11 17:22:34.391659: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1368 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1 2018-01-11 17:22:34.391731: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_sizelimit from 100 to 110 INFO:tensorflow:Starting standard services. INFO:tensorflow:Starting queue runners. INFO:tensorflow:Saving checkpoint to path ./log/original/model.ckpt INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Epoch 1/300 INFO:tensorflow:Current Learning Rate: [0.00050000002] INFO:tensorflow:global step 1: loss: 0.3121 (4.79 sec/step) Current Streaming Accuracy: 0.0000 Current Mean IOU: 0.0000 INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0000 Validation Mean IOU: 0.0000 (2.24 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0209 Validation Mean IOU: 0.0030 (1.10 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0207 Validation Mean IOU: 0.0028 (1.26 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0227 Validation Mean IOU: 0.0033 (1.23 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0220 Validation Mean IOU: 0.0035 (1.24 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0208 Validation Mean IOU: 0.0033 (1.28 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0201 Validation Mean IOU: 0.0033 (1.22 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0198 Validation Mean IOU: 0.0032 (1.25 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0032 (1.24 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.21 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0031 (1.18 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.21 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.39 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0032 (1.23 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0032 (1.18 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0193 Validation Mean IOU: 0.0032 (1.16 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0191 Validation Mean IOU: 0.0031 (1.41 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0193 Validation Mean IOU: 0.0031 (1.26 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0195 Validation Mean IOU: 0.0032 (1.43 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0032 (1.32 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0202 Validation Mean IOU: 0.0033 (1.34 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0204 Validation Mean IOU: 0.0034 (1.33 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0203 Validation Mean IOU: 0.0034 (1.21 sec/step) INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0206 Validation Mean IOU: 0.0034 (1.36 sec/step) 2018-01-11 17:23:21.808311: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: assertion failed: [all dims of \'image.shape\' must be > 0.] [[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]] INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, assertion failed: [all dims of \'image.shape\' must be > 0.] [[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]] INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0207 Validation Mean IOU: 0.0035 (1.19 sec/step) 2018-01-11 17:23:22.187077: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000 [[Node: Reshape_5 = Reshape[T=DT_UINT8, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](batch_1/_5971, Reshape_5/shape)]] 2018-01-11 17:23:22.197319: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000 [[Node: Reshape_5 = Reshape[T=DT_UINT8, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](batch_1/_5971, Reshape_5/shape)]] Traceback (most recent call last): File "train_enet.py", line 340, in run() File "train_enet.py", line 337, in run plt.savefig(photodir+"/image" + str(i)) File "/usr/lib64/python2.7/contextlib.py", line 35, in exit self.gen.throw(type, value, traceback) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run enqueue_callable() File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run target_list_as_strings, status, None) File "/usr/lib64/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [all dims of \'image.shape\' must be > 0.] [[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]]

ghost commented 6 years ago

have you figured out how it works? I trained on my own dataset as well, but the accuracy is so low..

kangyang94 commented 5 years ago

heollo, @changlinzhang @kwotsin could you tell me how to use the files in the checkpoint folder as the pretrain model to train my own dataset?

RobinHan24 commented 5 years ago

hello,everyone,so how to make our data set to train? Thank you.

RobinHan24 commented 5 years ago

have you figured out how it works? I trained on my own dataset as well, but the accuracy is so low..

I made my own dataset, but I met errors below InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2] [[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Traceback (most recent call last): File "train_enet.py", line 337, in run() File "train_enet.py", line 293, in run loss, training_accuracy, training_mean_IOU = train_step(sess, train_op, sv.global_step, metrics_op=metrics_op) File "train_enet.py", line 202, in train_step total_loss, global_step_count, accuracy_val, mean_IOUval, = sess.run([train_op, global_step, accuracy, mean_IOU, metrics_op]) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2] [[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Caused by op u'mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert', defined at: File "train_enet.py", line 337, in run() File "train_enet.py", line 192, in run mean_IOU, mean_IOU_update = tf.contrib.metrics.streaming_mean_iou(predictions=predictions, labels=annotations, num_classes=num_classes) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/contrib/metrics/python/ops/metric_ops.py", line 3528, in streaming_mean_iou name=name) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/metrics_impl.py", line 1128, in mean_iou num_classes, weights) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/metrics_impl.py", line 298, in _streaming_confusion_matrix labels, predictions, num_classes, weights=weights, dtype=dtypes.float64) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/confusion_matrix.py", line 171, in confusion_matrix labels, num_classes_int64, message='labels out of bound')], File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/check_ops.py", line 559, in assert_less return control_flow_ops.Assert(condition, data, summarize=summarize) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped return _add_should_use_warning(fn(*args, kwargs)) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 157, in Assert guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard") File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, *kwargs) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2057, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1895, in BuildCondBranch original_result = fn() File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 155, in true_assert condition, data, summarize, name="Assert") File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 51, in _assert name=name) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(args, kwargs) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2] [[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Could you help me please

c13proto commented 5 years ago

I faced same problem. In my case, I remade annotation images not including value of '255' and works. https://github.com/DrSleep/tensorflow-deeplab-resnet/issues/107#issuecomment-325857231

x7hkvip commented 5 years ago

@RobinHan24 I met the same problem.I have 10 classes,according my classes,I set the pixels of my label images to 0 to 9,then the problem fixed.I don't wither it is helpful for you?

jayashreek3 commented 5 years ago

thanks for this useful repo hi everyone if anyone could help me out to solve this issue

1) the current code works for camvid dataset,
2) am facing a difficulty in training this ENet model with cityscapes dataset : which i tried using https://github.com/mcordts/cityscapesScripts and got trained data, now i would like to import this similar data in this code but states dimension miss match, could you please help me to fix this grey scale image insertion as i have 4types(color.png,instance.png,labeld.png,json.png,trainid.png) of labeling after training the data. how to choose anyone from this folder and import in this model i tried for single type of images and got error:

InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [12] [[node mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/metrics/python/ops/metric_ops.py:3561) = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2)]]

as i am beginner to this field so, hoping for suggestions to resolve this error.