deeplab doesn't predict correctly the segmentation masks

kirk86 commented 6 years ago

Everything seems to be working properly training/evaluation etc. except from the fact that deeplab doesn't predict the segmentation masks.

Example: 001278_image

001278_prediction

The original images in the dataset are either colored like the above one or black and white, but all the masks are black and white.

Raj-08 commented 6 years ago

same with me Did you freeze your model?

kirk86 commented 6 years ago

@Raj-08 nope. Just trained normally for 25000 steps and then did validation and that's what I get.

Raj-08 commented 6 years ago

check your data that you are visualizing , it should be with 0-1 and type should be uint8 . Divide it by np.max i.e data/np.max(data) convert to uint8 and then visualize , you should get it right then

kirk86 commented 6 years ago

@Raj-08 thanks for the response. I'm a bit confused though.

Divide it by np.max i.e data/np.max(data)

That's scaling the data in the range [0-1]. i.e. float. Then making it uint8 makes them int. I dunno how that's gonna help.

Question 1: Should the data be in float in range[0-1] or should it be uint8 in range [0-255]? Question 2: Should we do the same for the masks? Quesiton 3: Does deeplab also require bounding boxes for the masks or it can work without that? Question 4: Does it matter if the data are of different sizes, .e.g. img1.size = 256x320x3, img2.size=520x320x3, ..., etc.? Or should we make them all fixed size?

Thanks!

meteorshowers commented 6 years ago

@kirk86 I met the same problem! did you change the last logits layer from 21 to 2? If so , I think maybe I can solve the problem.

meteorshowers commented 6 years ago

@Raj-08 how to freez part of my model? Thanks!

kirk86 commented 6 years ago

@meteorshowers Yes I did change the class number from 21 to 2.

khcy82dyc commented 6 years ago

@kirk86 Hi have you figured out the solution? I had exactly the same issue...

kirk86 commented 6 years ago

@khcy82dyc No TBH, I haven't, my 2 cents after spending a week chasing bugs not only on the deeplab model but on faster-rcnn as well is get as far away as you can. Most of the models have additional complexity in terms of trying to understand them since they use slim. In my experience most models break once you change a bit the configuration settings. For instance for faster-rcnn once you change to multi-gpu training things again break.

khcy82dyc commented 6 years ago

@kirk86 @Raj-08 @meteorshowers I may have found a solution, it's to do with the number of class value and for some reason it should not include background as one class. otherwise it will produce a blank segmentation. So for this case kirk86 might need to change it to 1 instead of 2.

other funny behaviours I have encountered: for my case I had 1500 500*500 images with 7 classes including background: I'm using initialize_last_layer=False, last_layers_contain_logits_only=True, fine_tune_batch_norm=False and Deeplabv3_xception as the initial checkpoint

if i set my background to any other values instead of 0, training produce constant loss value

if I set my background to 0 and number of class I set 7, I get blank prediction

If I set fine_tune_batch_norm=true the loss will become 6 digit at 50000 steps,

if I set fine_tune_batch_norm=false I get "Loss is inf or nan. : Tensor had NaN values" even with a learning rate 0.00001

Do you guys mind share you training parameter here please?

(would be nice if the @aquariusjay can provide some suggestions the reason for these strange issues)

kirk86 commented 6 years ago

@khcy82dyc TBH with you I think I've tried it even with 1 class and still had the same results plus all the models under tensorflow/research are so convoluted with unnecessary code that makes it hard for me to understand what they are actually doing. As I said after spending, cough, wasting some time I decided to look elsewhere. It's been like a month since I last touched them, so I can't even remember the settings that I had. But let me say something last. From my understanding you've spend some time debugging and tried multiple configurations from what you're saying, if none of those configurations is working then maybe ............................. cough!

georgosgeorgos commented 6 years ago

@khcy82dyc I'm trying to train the model on a different dataset. After 3-4k iterations, the loss stops to decrease and starts to oscillate (20k iterations). Did you experiment something similar?

sid6641 commented 6 years ago

@kirk86 @khcy82dyc , I just can't seem to make it work on my custom dataset, working with mobilenet_v2 version, getting the same complete black output. Running inference on pre-trained model works fine for me. Tried most of the things @khcy82dyc mentioned, its the same result.

Any suggestions?

georgosgeorgos commented 6 years ago

I'm having the same problem with the black output. Probably it is something related to the number or order of classes.

holyprince commented 6 years ago

@meteorshowers ,hi, I also want to use own data which has the different class number but meet the problem, I only change the dataset settings in train.py and segmentation_dataset.py from the manual. I want to ask where I can change the last logits layer from 21 to 2. Thank you!

shanyucha commented 6 years ago

got the same issue. i used deeplab to do lane line segmentation. But unfortunately i got a constant loss between 0.2 and 0.3 and black output predicted images.

Soulempty commented 6 years ago

@khcy82dyc Hello,have you have a good solution about the 1 class training?I also meet the problems.

shanyucha commented 6 years ago

@kirk86 do you have imbalanced data distribution like 1:100? i encountered your problem before, after setting different weight_loss, i got non-black masks.

kirk86 commented 6 years ago

@shanyucha it is imbalanced but not at that level more of 40:60.

Raj-08 commented 6 years ago

@meteorshowers Do you mean freeze the weights before training or generate frozen graph from trained model ?

GWwangshuo commented 6 years ago

@georgosgeorgos My loss also keeps no change. Any suggestions? Thanks.

XL2013 commented 6 years ago

@shanyucha I met the same imbalanced data distribution problem, but I don't know how to set the weight_loss, did you mean the weight parameter in tf.losses.softmax_cross_entropy? if not, how to change the weight_loss?

shanyucha commented 6 years ago

@XL2013 you can refer to this issue: ttps://github.com/tensorflow/models/issues/3730

In your case, the data samples may be strongly biased to one of the classes. That is why the model only predicts one class in the end. To handle that, I would suggest using larger loss_weight for the under-sampled class (i.e., that class that has fewer data samples). You could modify the weights in line 72 by doing something like weights = tf.to_float(tf.equal(scaled_labels, 0)) label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0 where you need to tune the label0_weight and label1_weight (e.g., set label0_weight=1 and increase label1_weight).

georgosgeorgos commented 6 years ago

@GWwangshuo did you solve? In my case, the problem was in the classes Ids

Blackpassat commented 6 years ago

@shanyucha Hi, I met the problem of constant loss as well. Have you solved this issue? I also assigned weights to different classes. But my loss stays constant around 0.11

bleedingfight commented 6 years ago

@shanyucha Thanks for you reply,I use deeplabv3+ train my data,My result is black(object pixle is 1,others is 0),I want segment lane which in my data,but the lane pixle number is too small,If I change the weight ,what i should change it?is like 64 scaled_labels = tf.reshape(scaled_labels, shape=[-1])

 65     # not_ignore_mask = tf.to_float(tf.not_equal(scaled_labels,
 66     #                                            ignore_label)) * loss_weight
 67     not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(t    f.equal(scaled_labels, 1)) * 2 + tf.to_float(tf.equal(scaled_labels, ignore_label    )) * 1

I changed the code and set the weight is 2 but the result is too bad,Should change 2 to some other number,>2 or <2,why?thanks for your help!

Blackpassat commented 6 years ago

Hi @bleedingfight , I tried the same way as you did for adding weights to labels. But I ended up with a loss oscillating around 0.11. How's your loss perform?

shanyucha commented 6 years ago

@bleedingfight the weight is decided by the ratio between class 0 and class 1 in your case. and the ignore_label class should get weight 0 i suppose.

Blackpassat commented 6 years ago

Hi @shanyucha , my weights are assigned according to the ratio of different classes and the weight for ignore label is 0. However my loss seems not decaying. Did you get a normal result after assigning the weights?

bleedingfight commented 6 years ago

@Blackpassat I'm sorry for late to reply,I train my models some days,but The result is so bad too.I just change the weight from 10,200 ,500,1000,but The loss oscillating around 1.but hen I train over 200000 steps,The loss have some time is 0.5-0.7,but it is oscillating.can you I have try to alter:

decrease lr from 0.001,0.0001,0.00001,0.00001,the result is same(I just train about 10w step and ctrl+c,alter lr to another value to save training time )
I increase the train_batch from 8 to 32,but some error like "oop xxx",even though I have nvidia titan x(12GB) *8,clone_num=8,I change train_batch value to 16 it works but loss oscillating around 1,miou=0.46

I change the value of OS, --atrous_rates=3 --atrous_rates=6 --atrous_rates=9 --output_stride=32 config is lick:

python "${WORK_DIR}"/train.py \
--logtostderr \
--initialize_last_layer=False \
--num_clones=8 \
--last_layers_contain_logits_only=True \
--dataset='lane_seg' \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=3 \
--atrous_rates=6 \
--atrous_rates=9 \
--output_stride=32 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=16 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--fine_tune_batch_norm=True \
--tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_pascal_train_aug/model.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${LANE_DATASET}"

The loss oscillating

In utils/train_util.py,an sentense is like # Use larger learning rate for last layer variables,so I change the value like that:
```
142     for layer in last_layers:
143       if layer in var.op.name and 'biases' in var.op.name:
144         gradient_multipliers[var.op.name] = 50 * last_layer_gradient_multiplier
145         break
146       elif layer in var.op.name:
147         gradient_multipliers[var.op.name] = 10*last_layer_gradient_multiplier
148         break
```
the same result trained 1w steps have someone know why?what parameter i can change to decrease loss?My initial model is deeplabv3_pascal_train_aug_2018_01_04.tar.gz as writed in local_test.sh.My initial model is wrong?

bleedingfight commented 6 years ago

@shanyucha I'm sorry to later to reply,I don't understand what your means.I should change the weight like that:

not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(tf    .equal(scaled_labels, 1)) * 200 + tf.to_float(tf.equal(scaled_labels, ignore_label    )) * 0

set tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0 or set 200 to value in range[0-1]?can you tell me what the three part's means?I think it is that:

tf.to_float(tf.equal(scaled_labels, 0)) * 1:stand for my background label's loss,1 stands for weights
tf.to_float(tf.equal(scaled_labels, 1)) * 200:stand for my object(lane) label's loss,200 stands for weights,In my dataset,lane label is too tittle,so I tell loss function is you met lane loss will greater then background,so increase the label 1's significance.200 shold be finetuned for better value.
tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0:stand for ignore label loss,0 stands for weight,because in my case mask png only include 0(background and lane),so I set ignore label is 255(as lick voc),so the weight shold shet to 0 to tell deeplab don't deal with 255 pixle. Can you tell what my opition is wrong?thanks very much.

shanyucha commented 6 years ago

@bleedingfight your understanding is right if your ratio between label 0 and label 1 is 200:1

sunformoon commented 6 years ago

For the data imbalance problem, maybe the hard examples mining or doing the positive examples augmentation (augment the positive patches) is another way rather than just brutely setting a large loss_weight for the imbalanced class.

shanyucha commented 6 years ago

exactly. the best way is to balance samples from the beginning. set different weights is a tradeoff.

zhaolewen commented 6 years ago

After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:

num_classes = num of fore-ground object class count + background, thus in my case it's 2
ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)
the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class.
in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things)

I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.

getsanjeev commented 6 years ago

@zhaolewen Thank you for being so descriptive. I am facing a similar issue, my model is predicting NaNs for all pixel values. Do you have any sample code for DataGenerator for segmentation in keras? Anything you would like to suggest?

zhaolewen commented 6 years ago

Hi @getsanjeev , I don't know about the DataGenerator class, I've modified the segmentation_dataset.py, what is important is: _PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor( splits_to_sizes={ 'train': 1464, 'train_aug': 10582, 'trainval': 2913, 'val': 1449, }, num_classes=21, # background + nb of object classes ignore_label=255, # set the values for your objects from 1 to 2, 3, 4, etc. sets the places where you want to ignore as 255 )

If you're having NaNs all the time, I think it's more because of the configuration and the input instead of the code.

kritiyer commented 6 years ago

@zhaolewen Thank you for your detailed description! I also have a binary segmentation problem, but I set the background class to be 0 (ignore_label=0) and the object class as 1. Is it necessary to change the background to 255 and the object to 0? My predicted images are blank (all pixels are labeled as class 1).

edit: I misunderstood the 4th point. I think zhaolewen meant that you should not set anything for the ignore-label parameter, not that 0 was a bad choice for the ignore-label parameter

lillyro commented 6 years ago

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

Hi @zhaolewen I misunderstand that "since my num_classes=2, all values above and equal to 2 seems to be ignored". How can we get this information? If I want use the object as 255, how can we change my code? Thanks!

zhaolewen commented 6 years ago

@lillyro you can divide the values in your image by 255, then your num_classes woulde also be 2, because you've got 0 as background and 1 as the object.

omair50 commented 5 years ago

Hi @zhaolewen, Could you please share the configuration settings for train.py for your case? My loss is oscillating around 0.2.

anandhupvr commented 5 years ago

@omair50 find any solution ?

margokhokhlova commented 5 years ago

After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:

num_classes = num of fore-ground object class count + background, thus in my case it's 2

ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)

the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class.

in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things)

I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.

A late question, were the images integers or floats (labels?)

margokhokhlova commented 5 years ago

Thank you, I followed this one, and it works, the model learns, although the final performance I am getting is lower than the one I get with the Unet. I also add sigmoid activation to the model output and use bce + jaccard loss.

hakS07 commented 5 years ago

@kirk86 did you solve the issue because i have the same problem i trained with 7000 iterations the loss=0.2 (fixed between 0.3 0.1) my data=1500 images(300*400) crop_size=513 should i do more training or what

kirk86 commented 5 years ago

@essalahsouad unfortunately not with deeplab, but you might wanna look elsewhere. Also 1500 images seems a bit low for deep learning, usually any deep learning model requires at least 3000-5000 samples to have good generalization.

hakS07 commented 5 years ago

@kirk86 thank you

hakS07 commented 5 years ago

@Raj-08, sir, in your first comment, what do you mean by "data should be unit8"+ "Divide it by np.max i.e data/np.max(data)" knowing that I'm facing the same problem, my trained model predicts a black mask my data=RGB images+(0,1) labels

hakS07 commented 5 years ago

https://github.com/tensorflow/models/issues/3739#issuecomment-402583877

After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important: num_classes = num of fore-ground object class count + background, thus in my case it's 2 ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255) the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class. in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things) I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png. I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum. By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.

@zhaolewen thank you for being so descriptive, I'm facing the same issue my trained model predicts a black mask
i followed your suggestion data=RGB images+(0,1)labels class=2(1 object+background) ignore_label=255 even after 1000 iteration I got the same result, blank mask can you help me

billy2618 commented 5 years ago

I don't know if there is anyone still bothered by the same question. I'm also new to tensorflow, but I've trained and get quite decent result with deeplabv3+ implemented by pytorch, under the same data.

Here is what I found. I just faced the same problem with class=2. Beside the back segmentation result from vis.py, I also found I get unreasonable miou =1.0 from eval.py. At the beginning I thought there might be some bug inside eval.py or even tf.metrics.mean_iou. However, after I double check the label (by tf.unique) extracted from tfrecord file, I suddenly realize that the reason I got miou=1.0 and black segmentation result is that all of my labels are assigned to 0. It all make sense and there is no bug between this file. The problem comes from the tf.decode_png at data_generator.py.

The tf.decode_png function will decode the png file with type uint8 by default, but my mask generate from the labelme is save as uint16.

So the solution to this problem is either pass the dtype=tf.dtypes.uint16 argument to tf.decode_png, or change the datatype of mask to uint8. I choose the latter one in case there is another tf.decode_png.

tensorflow / models

deeplab doesn't predict correctly the segmentation masks #3739

https://github.com/tensorflow/models/issues/3739#issuecomment-402583877