Training loss doesn't converge for custom dataset

panovr commented 7 years ago

My custom training dataset has 5000 color images and 5000 corresponding mask images.

2017-04-21 21:21:12,678 Start optimization
2017-04-21 21:21:14,525 Iter 0, Minibatch Loss= 0.6760, Training Accuracy= 0.6090, Minibatch error= 39.1%
2017-04-21 21:21:15,047 Iter 2, Minibatch Loss= 0.7317, Training Accuracy= 0.4318, Minibatch error= 56.8%
2017-04-21 21:21:15,608 Iter 4, Minibatch Loss= 0.5663, Training Accuracy= 0.7503, Minibatch error= 25.0%
2017-04-21 21:21:16,575 Iter 6, Minibatch Loss= 0.5671, Training Accuracy= 0.7178, Minibatch error= 28.2%
2017-04-21 21:21:17,157 Iter 8, Minibatch Loss= 0.3097, Training Accuracy= 0.8937, Minibatch error= 10.6%
2017-04-21 21:21:17,694 Iter 10, Minibatch Loss= 0.5114, Training Accuracy= 0.7774, Minibatch error= 22.3%
2017-04-21 21:21:18,233 Iter 12, Minibatch Loss= 0.6332, Training Accuracy= 0.6583, Minibatch error= 34.2%
2017-04-21 21:21:18,710 Iter 14, Minibatch Loss= 0.5695, Training Accuracy= 0.7293, Minibatch error= 27.1%
2017-04-21 21:21:19,249 Iter 16, Minibatch Loss= 0.4922, Training Accuracy= 0.8320, Minibatch error= 16.8%
2017-04-21 21:21:19,754 Iter 18, Minibatch Loss= 0.5962, Training Accuracy= 0.7211, Minibatch error= 27.9%
2017-04-21 21:21:19,977 Epoch 0, Average loss: 0.5595, learning rate: 0.2000
2017-04-21 21:21:20,200 Verification error= 15.3%, loss= 0.4723

2017-04-21 21:27:59,767 Epoch 48, Average loss: 0.4393, learning rate: 0.0171
2017-04-21 21:27:59,993 Verification error= 15.3%, loss= 0.4511
2017-04-21 21:28:01,688 Iter 980, Minibatch Loss= 0.6371, Training Accuracy= 0.7971, Minibatch error= 20.3%
2017-04-21 21:28:02,623 Iter 982, Minibatch Loss= 0.3423, Training Accuracy= 0.9056, Minibatch error= 9.4%
2017-04-21 21:28:03,648 Iter 984, Minibatch Loss= 0.6891, Training Accuracy= 0.6723, Minibatch error= 32.8%
2017-04-21 21:28:04,706 Iter 986, Minibatch Loss= 0.4940, Training Accuracy= 0.7985, Minibatch error= 20.1%
2017-04-21 21:28:05,809 Iter 988, Minibatch Loss= 0.3383, Training Accuracy= 0.9188, Minibatch error= 8.1%
2017-04-21 21:28:06,813 Iter 990, Minibatch Loss= 0.4692, Training Accuracy= 0.7797, Minibatch error= 22.0%
2017-04-21 21:28:07,792 Iter 992, Minibatch Loss= 0.7902, Training Accuracy= 0.5315, Minibatch error= 46.8%
2017-04-21 21:28:08,937 Iter 994, Minibatch Loss= 0.6003, Training Accuracy= 0.7040, Minibatch error= 29.6%
2017-04-21 21:28:09,864 Iter 996, Minibatch Loss= 0.4520, Training Accuracy= 0.7768, Minibatch error= 22.3%
2017-04-21 21:28:10,906 Iter 998, Minibatch Loss= 0.6925, Training Accuracy= 0.7355, Minibatch error= 26.5%
2017-04-21 21:28:11,301 Epoch 49, Average loss: 0.5205, learning rate: 0.0162
2017-04-21 21:28:11,525 Verification error= 15.3%, loss= 0.4540
2017-04-21 21:28:12,691 Optimization Finished!

I use this code for custom dataset training:

from tf_unet import image_util
from tf_unet import unet
from tf_unet import util

search_path = 'data/train/*.jpg'
data_provider = image_util.ImageDataProvider(search_path, data_suffix='.jpg', mask_suffix='.png')

net = unet.Unet(channels=data_provider.channels, n_class=data_provider.n_class, layers=3, features_root=32)

trainer = unet.Trainer(net, optimizer="momentum", opt_kwargs=dict(momentum=0.2))

path = trainer.train(data_provider, "./unet_trained", training_iters=20, epochs=50, display_step=2)

jakeret commented 7 years ago

The Average loss seems to get smaller (at least in the two epochs listed). Have you checked how the curves look like in Tensorboard? Do they remain on constant?

layers=3, features_root=32 is a rather small network. I would experiment with a feature size of 64 and more layers (maybe also more epochs).

You could also change the optimizer to adam which tends to do a better job with hard problems.

Finally, you could try to normalize your input.

Hope that helps a bit

panovr commented 7 years ago

@jakeret May I ask that how can I checked the curves in Tensorboard? The tf_unet document says:

Keep track of the learning progress using Tesorboard. tf_unet automatically outputs relevant summaries.

But I still don't know how.

AlibekJ commented 7 years ago

run tensorboard --logdir=<FULL PATH TO YOUR Train FOLDER>

agrafix commented 7 years ago

I am observing a similar issue. At first I assumed that my mask/input data was invalid, but I checked this by adding print statements in the ImageDataProvider and it seems good. After the first epoch, the prediction is completely filled with one label (=> whole image is black). With tensorboard I can also observe that the deconv_concat layers remain mostly "empty".

"my" code:

from tf_unet import unet, util, image_util

from PIL import Image
import glob
import numpy as np

output_path = "out"
epochs=12
iters=32

class ImageDataProvider(image_util.BaseDataProvider):
    n_class = 2

    def __init__(self, search_path, a_min=None, a_max=None, data_suffix=".tif", mask_suffix='_mask.tif'):
        super(ImageDataProvider, self).__init__(a_min, a_max)
        self.data_suffix = data_suffix
        self.mask_suffix = mask_suffix
        self.file_idx = -1

        self.data_files = self._find_data_files(search_path)

        assert len(self.data_files) > 0, "No training files"
        print("Number of files used: %s" % len(self.data_files))

        img = self._load_file(self.data_files[0])
        self.channels = 1 if len(img.shape) == 2 else img.shape[-1]

    def _find_data_files(self, search_path):
        all_files = glob.glob(search_path)
        return [name for name in all_files if not self.mask_suffix in name]

    def _load_file(self, path, dtype=np.float32):
        return np.array(Image.open(path), dtype)

    def _cylce_file(self):
        self.file_idx += 1
        if self.file_idx >= len(self.data_files):
            self.file_idx = 0

    def _next_data(self):
        self._cylce_file()
        image_name = self.data_files[self.file_idx]
        label_name = image_name.replace(self.data_suffix, self.mask_suffix)

        img = self._load_file(image_name, np.float32)
        label = self._load_file(label_name, np.float32)

        booler = np.vectorize(lambda t: t > 0)
        label = booler(label)

        return img,label

#preparing data loading
data_provider = ImageDataProvider("data/*.jpg", a_min=0, a_max=255, data_suffix=".jpg", mask_suffix="_seg.jpg")

#setup & training
net = unet.Unet(channels=data_provider.channels, n_class=data_provider.n_class, layers=5, features_root=64)
trainer = unet.Trainer(net, optimizer="adam")
path = trainer.train(data_provider, output_path, training_iters=iters, epochs=epochs)

# validation
x_test, y_test = data_provider(1)
prediction = net.predict(path, x_test)

unet.error_rate(prediction, util.crop_to_shape(y_test, prediction.shape))

img = util.combine_img_prediction(x_test, y_test, prediction)
util.save_image(img, "prediction.jpg")

I also (obviously) get this warning:

/usr/local/lib/python3.5/dist-packages/tf_unet-0.1.0-py3.5.egg/tf_unet/util.py:74: RuntimeWarning: invalid value encountered in true_divide
  img /= np.amax(img)

agrafix commented 7 years ago

This has something todo with the layers=5 - if I set layers=3 the issues go away. Will have to wait on training to see if it converges.

AlibekJ commented 7 years ago

Depends on your data, obviously. My set converges after 30-50 epochs

panovr commented 7 years ago

@jakeret @AlibekJ My custom training dataset has 5000 color images and 5000 corresponding mask images. Can you recommend settings for these parameters:

layers
features_root
training_iters
epochs

jakeret commented 7 years ago

@agrafix from a first glance at your code I guess it's ok. Wondering if this has something to do with issue #28

jakeret commented 7 years ago

@panovr with deep learning there is typically not one correct answer. Tuning the hyperparameter is the "art" of this approach.

To get started I would use 3-4 layers, 64 features, 32 training iterations and 100 epochs

agrafix commented 7 years ago

My set seems to converge now (with three layers), but the results are not really good. I can not increase the number of layers though because then I run into the bug that everything gets the same label very quickly.

panovr commented 7 years ago

@jakeret Follow your suggestion, below is my training code:

search_path = 'data/train/*.jpg'
data_provider = image_util.ImageDataProvider(search_path, a_min=0, a_max=255, data_suffix='.jpg', mask_suffix='.png')

net = unet.Unet(channels=data_provider.channels, n_class=data_provider.n_class, layers=3, features_root=64)

trainer = unet.Trainer(net, optimizer='adam')

path = trainer.train(data_provider, './unet_trained', training_iters=32, epochs=100, display_step=2)

Some images generated in the prediction in the training procedure are:

epoch 0
epoch 10
epoch 20
epoch 80
accuracy
loss

From the accuracy and loss figures, it seems that the network didn't converge.

AlibekJ commented 7 years ago

I work with grayscale images and patterns I am searching for tend to appear at more or less the same locations. The thing crops out 20 pixels from each side and those pixels contain information important for segmentation, so adding a border with 20 extra pixels to the original images prior to training does help. My images are grayscale, so I reused the two remaining channels and encoded X, Y into them. This helped a lot.

jakeret commented 7 years ago

Most of the time I'm also working with grayscale images. Another thing you could experiment with is the dice_coefficient loss function instead of the default cross entropy.

btw: cool data set! 👍

agrafix commented 7 years ago

I think in the original UNet Paper the network is also only applied to grayscale images. My results with color images are also not very good.

ameliajimenez commented 7 years ago

Hello, I am having the same problem as @panovr with a smaller medical data set, gray scale images. Around 800 images with their corresponding masks.

I have tried to use 3 layers instead of 5, 16 and 64 features and different learning rates, but accuracy and loss didn't seem to converge. Predictions were all black.

My segmentation problem is highly unbalanced so after I have tried weighted cross entropy with weights [0.01, 0.99] but there is no improvement so far. Predictions now are based on the illumination of the image but not related to the masks that I'm giving, it seems like there is no learning happening.

@jakeret any idea of what's happening or any tips? Thanks in advance!

agrafix commented 7 years ago

@amejs see my comment above, this is likely due to a bug with variable layer/feature count.

ameliajimenez commented 7 years ago

hi @agrafix which comment do you mean? Because I have already tried reducing the number of layers to 3 and also changing the number of features to 16 and 64. Thanks for the repply!

jakeret commented 7 years ago

@amejs training a network with such a high class imbalance is very hard. A weighted loss or the dice coefficient loss function is probably only going to help marginaly. You should try to resample your dataset or create synthetic images

jakeret commented 7 years ago

@amejs in #28 it was reported:

Quick update. Found the issue. There is a bug in layers.py: In pixel_wise_softmax_2 and pixel_wise_softmax If the output_map is too large, then exponential_map goes to infinity, which causes nan when calculating the cost function. The following code fixes it, although we might want to find a better value to do the clipping: replace: exponential_map = tf.exp(output_map) with: exponential_map = tf.exp(tf.clip_by_value(output_map, -np.inf, 50))

Could you try if this helps?

panovr commented 7 years ago

@jakeret I have tried the dice_coefficient loss function, but the outputs are always black.

ameliajimenez commented 7 years ago

@jakeret thanks for the suggestions, unfortunately fixing the bug in layers didn't help. Is it possible to resize the input given to the network and consequently the next layers to get a similar (smaller) output?

akashmaity commented 7 years ago

Even I have the same problem, The network doesn't converge and the prediction maps are all black. Any working solution so far?? The prediction does not get affected by the ground truth masks.

jakeret commented 7 years ago

Hard to tell what is going on. Have you experimented with some preprocessing e.g. data/batch normalization?

akashmaity commented 7 years ago

I thought the data is already being normalized before training. Isn't it so?

jakeret commented 7 years ago

The data is automatically normalized to [0, 1). In this particular case I'm talking about this kind of normalization (zero-mean and unit variance)

akashmaity commented 7 years ago

Tried normalization. Still did not help much. Also observed that many people are having this problem where the output prediction consists of all zeros. Can this be caused due to class imbalance in the training data?

jakeret commented 7 years ago

This class imbalance is certainly not helping. Someone reported here that this might also be caused by an overflow/underflow problem in the softmax layer (#28). Unfortunately, I haven't had the time to look more

carsnwd commented 7 years ago

I'm getting a similar error to what @agrafix is getting. Getting a Nan in summary histogram for norm_grads..

/home/ubuntu/tf_unet/tf_unet/image_util.py:76: RuntimeWarning: invalid value encountered in true_divide
  data /= np.amax(data)
2017-07-11 22:07:05.958169: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.958416: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.958884: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959074: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959403: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959568: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960021: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960198: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960515: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960665: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960994: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961496: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961815: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961974: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.962283: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.962788: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963118: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963268: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963480: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963710: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963916: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964063: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964214: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964320: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
Traceback (most recent call last):
  File "main.py", line 31, in <module>
    model = trainer.train(training_data_provider, output_path="prediction", training_iters=100, epochs=225)
  File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 430, in train
    self.output_minibatch_stats(sess, summary_writer, step, batch_x, util.crop_to_shape(batch_y, pred_shape))
  File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 473, in output_minibatch_stats
    self.net.keep_prob: 1.})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]

Caused by op u'norm_grads', defined at:
  File "main.py", line 31, in <module>
    model = trainer.train(training_data_provider, output_path="prediction", training_iters=100, epochs=225)
  File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 390, in train
    init = self._initialize(training_iters, output_path, restore)
  File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 342, in _initialize
    tf.summary.histogram('norm_grads', self.norm_gradients_node)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 221, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 131, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: norm_grads
         [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]

Optimizer: Adam
Training Iters: 100
Epochs: 225
Layers: 3
Features Root: 32
Channels: 3 -N-Class: 2

It started happening when I added more data (from around 1.5k data to around 7k data images). I'm gonna spend time tweaking these numbers, lowering them, and seeing what happens. Happens at around 32 epochs. Found some similar errors with this and other Tensorflow nets like here. My data images are also 438 x 406 pixels each, not sure if it may be a size issue of it being too large?

If anyone knows whats up or what to try let me know. I'll come back if I figure it out.

carsnwd commented 7 years ago

I still haven't gotten that issue sorted out, we thought it might be a divide by 0 issue so we tried adding this to where the error is occurring in image_util.py...

    def _process_data(self, data):
        # normalization
        data = np.clip(np.fabs(data), self.a_min, self.a_max)
        data -= np.amin(data)
        # Fix
        try:
            data /= np.amax(data)
        except:
            print("##### ERROR DIVIDE BY 0##### Trying /1 on: ",data)
            data /= 1

It seems to always be the 3745 training iteration when this happens. Any help would be appreciated. Though still crashes with the same thing.

jakeret commented 7 years ago

I suspect that the initial problem is in the computation of the pixel_wise_softmax. I think it might worth investigating if changing the implementation is solving the issue. Instead of using the current approach I would try to reshape the input to [N, classes] run it thru tf.nn.softmax and then convert it back where N = batch_size * nx * ny * classes .

If someone has the time to try this I would greatly appreciate ;-)

carsnwd commented 7 years ago

Hmm ok I'll try to do that and look into fixing pixel_wise_softmax, I'll let you know if I get anywhere.

Also wanted to add that my fix up there did work to get it to actually run it turns out, just the results were crap if I use that fix, so not a solution haha.

abbyDC commented 7 years ago

hello! has anyone solved the problem with the network not converging/not learning? I'm not sure if it's with the way I load images because the base code I used is the same with demo_toy_problem.py but the predicted images I get during training is always the same as if the mask images does not affect the prediction. My input images are grayscale with dimensions 316 x 298. I hope someone could help. Thanks! here is the code I use:

from __future__ import division, print_function
import numpy as np

from tf_unet import image_gen
from tf_unet import unet
from tf_unet import util
from tf_unet import image_util

generator = image_util.ImageDataProvider('dataset/*.png', data_suffix=".png", mask_suffix='_mask.png')
x_test, y_test = generator(1)
net = unet.Unet(channels=generator.channels, n_class=generator.n_class, layers=3, features_root=64)
trainer = unet.Trainer(net, optimizer="adam")
trainer.verification_batch_size=16
path = trainer.train(generator, "./unet_trained", training_iters=50, epochs=30, display_step=2)
x_test, y_test = generator(1)
prediction = net.predict("./unet_trained/model.cpkt", x_test)

unet.error_rate(prediction, util.crop_to_shape(y_test, prediction.shape))
img = util.combine_img_prediction(x_test,y_test,prediction)
util.save_image("isles_problem.png")

jakeret commented 7 years ago

This is a bit out of scope of the initial issue. Anyway, have you checked the tensorboard? Is the loss decreasing? Do the other plots look halfway sensible? Is generator.channels and generator.n_class what you expect? Is the shape of x_test, y_test = generator(1) what you expect?

abbyDC commented 6 years ago

Hello! Thank you so much or all the help regarding my concern with the training @jakeret ! I think I have better results now. I just adjusted the batch size to 12 and applied batch norm using "tf.contrib.layers.batch_norm()" and now I've been getting the results I expected. :D

myway0101 commented 6 years ago

@abbyDC Could you share your modified code for changing batch_size to 12 and applying batch_norm? I believe it will be very helpful for others! Many thanks