Error when training with Dice Coefficient

bslin commented 7 years ago

Hi,

I get an error when I tried training with dice coefficient as the ago function. I noticed there was a new commit on this a couple days ago so I suspect it's some bug in the code. Would you know roughly where this might be?

InvalidArgumentError Traceback (most recent call last)

in () ----> 1 path = trainer.train(generator, "./unet_trained", training_iters=100, epochs=100, display_step=5) /home/proj/tf_unet/tf_unet/unet.pyc in train(self, data_provider, output_path, training_iters, epochs, dropout, display_step, restore) 424 425 if step % display_step == 0: --> 426 self.output_minibatch_stats(sess, summary_writer, step, batch_x, util.crop_to_shape(batch_y, pred_shape)) 427 428 total_loss += loss /home/proj/tf_unet/tf_unet/unet.pyc in output_minibatch_stats(self, sess, summary_writer, step, batch_x, batch_y) 467 feed_dict={self.net.x: batch_x, 468 self.net.y: batch_y, --> 469 self.net.keep_prob: 1.}) 470 summary_writer.add_summary(summary_str, step) 471 summary_writer.flush() /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata) 764 try: 765 result = self._run(None, fetches, feed_dict, options_ptr, --> 766 run_metadata_ptr) 767 if run_metadata: 768 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata) 962 if final_fetches or final_targets: 963 results = self._do_run(handle, final_targets, final_fetches, --> 964 feed_dict_string, options, run_metadata) 965 else: 966 results = [] /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1012 if handle is None: 1013 return self._do_call(_run_fn, self._session, feed_dict, fetch_list, -> 1014 target_list, options, run_metadata) 1015 else: 1016 return self._do_call(_prun_fn, self._session, handle, feed_dict, /usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args) 1032 except KeyError: 1033 pass -> 1034 raise type(e)(node_def, op, message) 1035 1036 def _extend_graph(self): InvalidArgumentError: Nan in summary histogram for: norm_grads [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_37/read)]] Caused by op u'norm_grads', defined at: File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py", line 3, in app.launch_new_instance() File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 474, in start ioloop.IOLoop.instance().start() File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 177, in start super(ZMQIOLoop, self).start() File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 887, in start handler_func(fd_obj, events) File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events self._handle_recv() File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback callback(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 276, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 390, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 501, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2717, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in path = trainer.train(generator, "./unet_trained", training_iters=100, epochs=100, display_step=5) File "/home/proj/tf_unet/tf_unet/unet.py", line 389, in train init = self._initialize(training_iters, output_path, restore) File "/home/proj/tf_unet/tf_unet/unet.py", line 342, in _initialize tf.summary.histogram('norm_grads', self.norm_gradients_node) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 205, in histogram tag=scope.rstrip('/'), values=values, name=scope) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__ self._traceback = _extract_stack() InvalidArgumentError (see above for traceback): Nan in summary histogram for: norm_grads [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_37/read)]]

bslin commented 7 years ago

The error gets hit only after some number of iterations. It seems to get hit after fewer iterations I use the adam optimizer rather than the momentum, but that might just be for my specific case. After enough iterations, I get this error regardless of the optimizer I use. The same training/testing data works fine if I use cross entropy as the cost function.

bslin commented 7 years ago

Quick update. Found the issue. There is a bug in layers.py: In pixel_wise_softmax_2 and pixel_wise_softmax

If the output_map is too large, then exponential_map goes to infinity, which causes nan when calculating the cost function.

The following code fixes it, although we might want to find a better value to do the clipping: replace: exponential_map = tf.exp(output_map) with: exponential_map = tf.exp(tf.clip_by_value(output_map, -np.inf, 50))

BTW thanks for providing the tf_unet code. It has been very helpful! :)

jakeret commented 7 years ago

Thanks for reporting this. I'm just wondering why the output_map gets so large

bslin commented 7 years ago

Yeah I'm wondering the same thing. I just noticed that I still get garbage results when training my data. (with cross entropy I was getting something more reasonable).

I have no idea why the output_map gets so large, I plan on looking into it some more a little later. Would you happen to have any ideas or theories to look into?

mateuszbuda commented 7 years ago

I have also encountered this issue. Using smaller learning rate helped. So maybe it's just an exploding gradient.

bslin commented 7 years ago

Maybe. Another thing I noticed was that to calculate the dice-coefficient, the original code is using both the channels together. When I use only one of the channels, the values I end up getting worked up to be better.

weiliu620 commented 7 years ago

This is a typical issue of overflow/underflow when computing the sum (exp (x)) function. Search 'log sum exp' on the web will give some explanation. The trick is to divide/multiply the same constant before exp function.

Or you can use tf.reduce_logsumexp or refer to source code of this function.

jakeret commented 7 years ago

@weiliu620 thanks for the hint. I'm going to look into this

jakeret commented 7 years ago

@weiliu620 following the lines from here refered in your SO question we would just have to subtract the result of tf.reduce_max in the tf.exp call, right?

jakeret / tf_unet

Error when training with Dice Coefficient #28