Open panovr opened 7 years ago
The Average loss
seems to get smaller (at least in the two epochs listed). Have you checked how the curves look like in Tensorboard? Do they remain on constant?
layers=3, features_root=32
is a rather small network. I would experiment with a feature size of 64 and more layers (maybe also more epochs).
You could also change the optimizer to adam
which tends to do a better job with hard problems.
Finally, you could try to normalize your input.
Hope that helps a bit
@jakeret May I ask that how can I checked the curves in Tensorboard? The tf_unet document says:
Keep track of the learning progress using Tesorboard. tf_unet automatically outputs relevant summaries.
But I still don't know how.
run
tensorboard --logdir=<FULL PATH TO YOUR Train FOLDER>
I am observing a similar issue. At first I assumed that my mask/input data was invalid, but I checked this by adding print statements in the ImageDataProvider and it seems good. After the first epoch, the prediction is completely filled with one label (=> whole image is black). With tensorboard I can also observe that the deconv_concat layers remain mostly "empty".
"my" code:
from tf_unet import unet, util, image_util
from PIL import Image
import glob
import numpy as np
output_path = "out"
epochs=12
iters=32
class ImageDataProvider(image_util.BaseDataProvider):
n_class = 2
def __init__(self, search_path, a_min=None, a_max=None, data_suffix=".tif", mask_suffix='_mask.tif'):
super(ImageDataProvider, self).__init__(a_min, a_max)
self.data_suffix = data_suffix
self.mask_suffix = mask_suffix
self.file_idx = -1
self.data_files = self._find_data_files(search_path)
assert len(self.data_files) > 0, "No training files"
print("Number of files used: %s" % len(self.data_files))
img = self._load_file(self.data_files[0])
self.channels = 1 if len(img.shape) == 2 else img.shape[-1]
def _find_data_files(self, search_path):
all_files = glob.glob(search_path)
return [name for name in all_files if not self.mask_suffix in name]
def _load_file(self, path, dtype=np.float32):
return np.array(Image.open(path), dtype)
def _cylce_file(self):
self.file_idx += 1
if self.file_idx >= len(self.data_files):
self.file_idx = 0
def _next_data(self):
self._cylce_file()
image_name = self.data_files[self.file_idx]
label_name = image_name.replace(self.data_suffix, self.mask_suffix)
img = self._load_file(image_name, np.float32)
label = self._load_file(label_name, np.float32)
booler = np.vectorize(lambda t: t > 0)
label = booler(label)
return img,label
#preparing data loading
data_provider = ImageDataProvider("data/*.jpg", a_min=0, a_max=255, data_suffix=".jpg", mask_suffix="_seg.jpg")
#setup & training
net = unet.Unet(channels=data_provider.channels, n_class=data_provider.n_class, layers=5, features_root=64)
trainer = unet.Trainer(net, optimizer="adam")
path = trainer.train(data_provider, output_path, training_iters=iters, epochs=epochs)
# validation
x_test, y_test = data_provider(1)
prediction = net.predict(path, x_test)
unet.error_rate(prediction, util.crop_to_shape(y_test, prediction.shape))
img = util.combine_img_prediction(x_test, y_test, prediction)
util.save_image(img, "prediction.jpg")
I also (obviously) get this warning:
/usr/local/lib/python3.5/dist-packages/tf_unet-0.1.0-py3.5.egg/tf_unet/util.py:74: RuntimeWarning: invalid value encountered in true_divide
img /= np.amax(img)
This has something todo with the layers=5
- if I set layers=3
the issues go away. Will have to wait on training to see if it converges.
Depends on your data, obviously. My set converges after 30-50 epochs
@jakeret @AlibekJ My custom training dataset has 5000 color images and 5000 corresponding mask images. Can you recommend settings for these parameters:
@agrafix from a first glance at your code I guess it's ok. Wondering if this has something to do with issue #28
@panovr with deep learning there is typically not one correct answer. Tuning the hyperparameter is the "art" of this approach.
To get started I would use 3-4 layers, 64 features, 32 training iterations and 100 epochs
My set seems to converge now (with three layers), but the results are not really good. I can not increase the number of layers though because then I run into the bug that everything gets the same label very quickly.
@jakeret Follow your suggestion, below is my training code:
search_path = 'data/train/*.jpg'
data_provider = image_util.ImageDataProvider(search_path, a_min=0, a_max=255, data_suffix='.jpg', mask_suffix='.png')
net = unet.Unet(channels=data_provider.channels, n_class=data_provider.n_class, layers=3, features_root=64)
trainer = unet.Trainer(net, optimizer='adam')
path = trainer.train(data_provider, './unet_trained', training_iters=32, epochs=100, display_step=2)
Some images generated in the prediction
in the training procedure are:
From the accuracy and loss figures, it seems that the network didn't converge.
I work with grayscale images and patterns I am searching for tend to appear at more or less the same locations. The thing crops out 20 pixels from each side and those pixels contain information important for segmentation, so adding a border with 20 extra pixels to the original images prior to training does help. My images are grayscale, so I reused the two remaining channels and encoded X, Y into them. This helped a lot.
Most of the time I'm also working with grayscale images.
Another thing you could experiment with is the dice_coefficient
loss function instead of the default cross entropy.
btw: cool data set! 👍
I think in the original UNet Paper the network is also only applied to grayscale images. My results with color images are also not very good.
Hello, I am having the same problem as @panovr with a smaller medical data set, gray scale images. Around 800 images with their corresponding masks.
I have tried to use 3 layers instead of 5, 16 and 64 features and different learning rates, but accuracy and loss didn't seem to converge. Predictions were all black.
My segmentation problem is highly unbalanced so after I have tried weighted cross entropy with weights [0.01, 0.99] but there is no improvement so far. Predictions now are based on the illumination of the image but not related to the masks that I'm giving, it seems like there is no learning happening.
@jakeret any idea of what's happening or any tips? Thanks in advance!
@amejs see my comment above, this is likely due to a bug with variable layer/feature count.
hi @agrafix which comment do you mean? Because I have already tried reducing the number of layers to 3 and also changing the number of features to 16 and 64. Thanks for the repply!
@amejs training a network with such a high class imbalance is very hard. A weighted loss or the dice coefficient loss function is probably only going to help marginaly. You should try to resample your dataset or create synthetic images
@amejs in #28 it was reported:
Quick update. Found the issue. There is a bug in layers.py: In pixel_wise_softmax_2 and pixel_wise_softmax If the output_map is too large, then exponential_map goes to infinity, which causes nan when calculating the cost function. The following code fixes it, although we might want to find a better value to do the clipping: replace: exponential_map = tf.exp(output_map) with: exponential_map = tf.exp(tf.clip_by_value(output_map, -np.inf, 50))
Could you try if this helps?
@jakeret
I have tried the dice_coefficient
loss function, but the outputs are always black.
@jakeret thanks for the suggestions, unfortunately fixing the bug in layers didn't help. Is it possible to resize the input given to the network and consequently the next layers to get a similar (smaller) output?
Even I have the same problem, The network doesn't converge and the prediction maps are all black. Any working solution so far?? The prediction does not get affected by the ground truth masks.
Hard to tell what is going on. Have you experimented with some preprocessing e.g. data/batch normalization?
I thought the data is already being normalized before training. Isn't it so?
The data is automatically normalized to [0, 1). In this particular case I'm talking about this kind of normalization (zero-mean and unit variance)
Tried normalization. Still did not help much. Also observed that many people are having this problem where the output prediction consists of all zeros. Can this be caused due to class imbalance in the training data?
This class imbalance is certainly not helping. Someone reported here that this might also be caused by an overflow/underflow problem in the softmax layer (#28). Unfortunately, I haven't had the time to look more
I'm getting a similar error to what @agrafix is getting. Getting a Nan in summary histogram for norm_grads..
/home/ubuntu/tf_unet/tf_unet/image_util.py:76: RuntimeWarning: invalid value encountered in true_divide
data /= np.amax(data)
2017-07-11 22:07:05.958169: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.958416: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.958884: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959074: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959403: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.959568: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960021: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960198: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960515: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960665: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.960994: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961496: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961815: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.961974: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.962283: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.962788: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963118: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963268: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963480: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963710: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.963916: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964063: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964214: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
2017-07-11 22:07:05.964320: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
Traceback (most recent call last):
File "main.py", line 31, in <module>
model = trainer.train(training_data_provider, output_path="prediction", training_iters=100, epochs=225)
File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 430, in train
self.output_minibatch_stats(sess, summary_writer, step, batch_x, util.crop_to_shape(batch_y, pred_shape))
File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 473, in output_minibatch_stats
self.net.keep_prob: 1.})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
Caused by op u'norm_grads', defined at:
File "main.py", line 31, in <module>
model = trainer.train(training_data_provider, output_path="prediction", training_iters=100, epochs=225)
File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 390, in train
init = self._initialize(training_iters, output_path, restore)
File "/home/ubuntu/tf_unet/tf_unet/unet.py", line 342, in _initialize
tf.summary.histogram('norm_grads', self.norm_gradients_node)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 221, in histogram
tag=scope.rstrip('/'), values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 131, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: norm_grads
[[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_27/read/_97)]]
It started happening when I added more data (from around 1.5k data to around 7k data images). I'm gonna spend time tweaking these numbers, lowering them, and seeing what happens. Happens at around 32 epochs. Found some similar errors with this and other Tensorflow nets like here. My data images are also 438 x 406 pixels each, not sure if it may be a size issue of it being too large?
If anyone knows whats up or what to try let me know. I'll come back if I figure it out.
I still haven't gotten that issue sorted out, we thought it might be a divide by 0 issue so we tried adding this to where the error is occurring in image_util.py...
def _process_data(self, data):
# normalization
data = np.clip(np.fabs(data), self.a_min, self.a_max)
data -= np.amin(data)
# Fix
try:
data /= np.amax(data)
except:
print("##### ERROR DIVIDE BY 0##### Trying /1 on: ",data)
data /= 1
It seems to always be the 3745 training iteration when this happens. Any help would be appreciated. Though still crashes with the same thing.
I suspect that the initial problem is in the computation of the pixel_wise_softmax
.
I think it might worth investigating if changing the implementation is solving the issue. Instead of using the current approach I would try to reshape the input to [N, classes]
run it thru tf.nn.softmax
and then convert it back where N = batch_size * nx * ny * classes
.
If someone has the time to try this I would greatly appreciate ;-)
Hmm ok I'll try to do that and look into fixing pixel_wise_softmax, I'll let you know if I get anywhere.
Also wanted to add that my fix up there did work to get it to actually run it turns out, just the results were crap if I use that fix, so not a solution haha.
hello! has anyone solved the problem with the network not converging/not learning? I'm not sure if it's with the way I load images because the base code I used is the same with demo_toy_problem.py but the predicted images I get during training is always the same as if the mask images does not affect the prediction. My input images are grayscale with dimensions 316 x 298. I hope someone could help. Thanks! here is the code I use:
from __future__ import division, print_function
import numpy as np
from tf_unet import image_gen
from tf_unet import unet
from tf_unet import util
from tf_unet import image_util
generator = image_util.ImageDataProvider('dataset/*.png', data_suffix=".png", mask_suffix='_mask.png')
x_test, y_test = generator(1)
net = unet.Unet(channels=generator.channels, n_class=generator.n_class, layers=3, features_root=64)
trainer = unet.Trainer(net, optimizer="adam")
trainer.verification_batch_size=16
path = trainer.train(generator, "./unet_trained", training_iters=50, epochs=30, display_step=2)
x_test, y_test = generator(1)
prediction = net.predict("./unet_trained/model.cpkt", x_test)
unet.error_rate(prediction, util.crop_to_shape(y_test, prediction.shape))
img = util.combine_img_prediction(x_test,y_test,prediction)
util.save_image("isles_problem.png")
This is a bit out of scope of the initial issue.
Anyway, have you checked the tensorboard? Is the loss decreasing? Do the other plots look halfway sensible? Is generator.channels
and generator.n_class
what you expect? Is the shape of x_test, y_test = generator(1)
what you expect?
Hello! Thank you so much or all the help regarding my concern with the training @jakeret ! I think I have better results now. I just adjusted the batch size to 12 and applied batch norm using "tf.contrib.layers.batch_norm()" and now I've been getting the results I expected. :D
@abbyDC Could you share your modified code for changing batch_size to 12 and applying batch_norm? I believe it will be very helpful for others! Many thanks
Hi, did anyone figure this problem out? I'm encountering the same issue as @panovr ...
run
tensorboard --logdir=<FULL PATH TO YOUR Train FOLDER>
By Train Folder do you mean the checkpoints folder or the folder of the repository??
run
tensorboard --logdir=<FULL PATH TO YOUR Train FOLDER>
By Train Folder do you mean the checkpoints folder or the folder of the repository??
It has to be the full path to a folder with tensorflow logs.
I have same problem. Any ideas?
I have same problem. Any ideas?
What exactly is the problem you are facing? What have you tried?
My custom training dataset has 5000 color images and 5000 corresponding mask images.
I use this code for custom dataset training: