Open kirk86 opened 6 years ago
same with me Did you freeze your model?
@Raj-08 nope. Just trained normally for 25000 steps and then did validation and that's what I get.
check your data that you are visualizing , it should be with 0-1 and type should be uint8 . Divide it by np.max i.e data/np.max(data) convert to uint8 and then visualize , you should get it right then
@Raj-08 thanks for the response. I'm a bit confused though.
Divide it by np.max i.e data/np.max(data)
That's scaling the data in the range [0-1]. i.e. float. Then making it uint8 makes them int. I dunno how that's gonna help.
Question 1: Should the data be in float in range[0-1] or should it be uint8 in range [0-255]? Question 2: Should we do the same for the masks? Quesiton 3: Does deeplab also require bounding boxes for the masks or it can work without that? Question 4: Does it matter if the data are of different sizes, .e.g. img1.size = 256x320x3, img2.size=520x320x3, ..., etc.? Or should we make them all fixed size?
Thanks!
@kirk86 I met the same problem! did you change the last logits layer from 21 to 2? If so , I think maybe I can solve the problem.
@Raj-08 how to freez part of my model? Thanks!
@meteorshowers Yes I did change the class number from 21 to 2.
@kirk86 Hi have you figured out the solution? I had exactly the same issue...
@khcy82dyc No TBH, I haven't, my 2 cents after spending a week chasing bugs not only on the deeplab model but on faster-rcnn as well is get as far away as you can. Most of the models have additional complexity in terms of trying to understand them since they use slim
. In my experience most models break once you change a bit the configuration settings. For instance for faster-rcnn once you change to multi-gpu training things again break.
@kirk86 @Raj-08 @meteorshowers I may have found a solution, it's to do with the number of class value and for some reason it should not include background as one class. otherwise it will produce a blank segmentation. So for this case kirk86 might need to change it to 1 instead of 2.
other funny behaviours I have encountered: for my case I had 1500 500*500 images with 7 classes including background: I'm using initialize_last_layer=False, last_layers_contain_logits_only=True, fine_tune_batch_norm=False and Deeplabv3_xception as the initial checkpoint
if i set my background to any other values instead of 0, training produce constant loss value
if I set my background to 0 and number of class I set 7, I get blank prediction
If I set fine_tune_batch_norm=true the loss will become 6 digit at 50000 steps,
if I set fine_tune_batch_norm=false I get "Loss is inf or nan. : Tensor had NaN values" even with a learning rate 0.00001
Do you guys mind share you training parameter here please?
(would be nice if the @aquariusjay can provide some suggestions the reason for these strange issues)
@khcy82dyc TBH with you I think I've tried it even with 1 class and still had the same results plus all the models under tensorflow/research
are so convoluted with unnecessary code that makes it hard for me to understand what they are actually doing. As I said after spending, cough, wasting some time I decided to look elsewhere. It's been like a month since I last touched them, so I can't even remember the settings that I had. But let me say something last. From my understanding you've spend some time debugging and tried multiple configurations from what you're saying, if none of those configurations is working then maybe ............................. cough!
@khcy82dyc I'm trying to train the model on a different dataset. After 3-4k iterations, the loss stops to decrease and starts to oscillate (20k iterations). Did you experiment something similar?
@kirk86 @khcy82dyc , I just can't seem to make it work on my custom dataset, working with mobilenet_v2 version, getting the same complete black output. Running inference on pre-trained model works fine for me. Tried most of the things @khcy82dyc mentioned, its the same result.
Any suggestions?
I'm having the same problem with the black output. Probably it is something related to the number or order of classes.
@meteorshowers ,hi, I also want to use own data which has the different class number but meet the problem, I only change the dataset settings in train.py and segmentation_dataset.py from the manual. I want to ask where I can change the last logits layer from 21 to 2. Thank you!
got the same issue. i used deeplab to do lane line segmentation. But unfortunately i got a constant loss between 0.2 and 0.3 and black output predicted images.
@khcy82dyc Hello,have you have a good solution about the 1 class training?I also meet the problems.
@kirk86 do you have imbalanced data distribution like 1:100? i encountered your problem before, after setting different weight_loss, i got non-black masks.
@shanyucha it is imbalanced but not at that level more of 40:60.
@meteorshowers Do you mean freeze the weights before training or generate frozen graph from trained model ?
@georgosgeorgos My loss also keeps no change. Any suggestions? Thanks.
@shanyucha I met the same imbalanced data distribution problem, but I don't know how to set the weight_loss, did you mean the weight parameter in tf.losses.softmax_cross_entropy? if not, how to change the weight_loss?
@XL2013 you can refer to this issue: ttps://github.com/tensorflow/models/issues/3730
@GWwangshuo did you solve? In my case, the problem was in the classes Ids
@shanyucha Hi, I met the problem of constant loss as well. Have you solved this issue? I also assigned weights to different classes. But my loss stays constant around 0.11
@shanyucha Thanks for you reply,I use deeplabv3+ train my data,My result is black(object pixle is 1,others is 0),I want segment lane which in my data,but the lane pixle number is too small,If I change the weight ,what i should change it?is like 64 scaled_labels = tf.reshape(scaled_labels, shape=[-1])
65 # not_ignore_mask = tf.to_float(tf.not_equal(scaled_labels,
66 # ignore_label)) * loss_weight
67 not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(t f.equal(scaled_labels, 1)) * 2 + tf.to_float(tf.equal(scaled_labels, ignore_label )) * 1
I changed the code and set the weight is 2 but the result is too bad,Should change 2 to some other number,>2 or <2,why?thanks for your help!
Hi @bleedingfight , I tried the same way as you did for adding weights to labels. But I ended up with a loss oscillating around 0.11. How's your loss perform?
@bleedingfight the weight is decided by the ratio between class 0 and class 1 in your case. and the ignore_label class should get weight 0 i suppose.
Hi @shanyucha , my weights are assigned according to the ratio of different classes and the weight for ignore label is 0. However my loss seems not decaying. Did you get a normal result after assigning the weights?
@Blackpassat I'm sorry for late to reply,I train my models some days,but The result is so bad too.I just change the weight from 10,200 ,500,1000,but The loss oscillating around 1.but hen I train over 200000 steps,The loss have some time is 0.5-0.7,but it is oscillating.can you I have try to alter:
python "${WORK_DIR}"/train.py \
--logtostderr \
--initialize_last_layer=False \
--num_clones=8 \
--last_layers_contain_logits_only=True \
--dataset='lane_seg' \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=3 \
--atrous_rates=6 \
--atrous_rates=9 \
--output_stride=32 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=16 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--fine_tune_batch_norm=True \
--tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_pascal_train_aug/model.ckpt" \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${LANE_DATASET}"
The loss oscillating
# Use larger learning rate for last layer variables
,so I change the value like that:
142 for layer in last_layers:
143 if layer in var.op.name and 'biases' in var.op.name:
144 gradient_multipliers[var.op.name] = 50 * last_layer_gradient_multiplier
145 break
146 elif layer in var.op.name:
147 gradient_multipliers[var.op.name] = 10*last_layer_gradient_multiplier
148 break
the same result trained 1w steps have someone know why?what parameter i can change to decrease loss?My initial model is deeplabv3_pascal_train_aug_2018_01_04.tar.gz as writed in local_test.sh.My initial model is wrong?
@shanyucha I'm sorry to later to reply,I don't understand what your means.I should change the weight like that:
not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(tf .equal(scaled_labels, 1)) * 200 + tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0
set tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0
or set 200 to value in range[0-1]?can you tell me what the three part's means?I think it is that:
@bleedingfight your understanding is right if your ratio between label 0 and label 1 is 200:1
For the data imbalance problem, maybe the hard examples mining or doing the positive examples augmentation (augment the positive patches) is another way rather than just brutely setting a large loss_weight for the imbalanced class.
exactly. the best way is to balance samples from the beginning. set different weights is a tradeoff.
After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:
num_classes
= num of fore-ground object class count + background, thus in my case it's 2ignore_label=255
means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)ignore_label=0
(to not complicate things)I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.
I my case, I labelled the object pixels as 128, but since my num_classes=2
, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.
By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.
@zhaolewen Thank you for being so descriptive. I am facing a similar issue, my model is predicting NaNs for all pixel values. Do you have any sample code for DataGenerator for segmentation in keras? Anything you would like to suggest?
Hi @getsanjeev , I don't know about the DataGenerator class, I've modified the segmentation_dataset.py
, what is important is:
_PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor( splits_to_sizes={ 'train': 1464, 'train_aug': 10582, 'trainval': 2913, 'val': 1449, }, num_classes=21, # background + nb of object classes ignore_label=255, # set the values for your objects from 1 to 2, 3, 4, etc. sets the places where you want to ignore as 255 )
If you're having NaNs all the time, I think it's more because of the configuration and the input instead of the code.
@zhaolewen Thank you for your detailed description! I also have a binary segmentation problem, but I set the background class to be 0 (ignore_label=0) and the object class as 1. Is it necessary to change the background to 255 and the object to 0? My predicted images are blank (all pixels are labeled as class 1).
edit: I misunderstood the 4th point. I think zhaolewen meant that you should not set anything for the ignore-label parameter, not that 0 was a bad choice for the ignore-label parameter
I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.
Hi @zhaolewen I misunderstand that "since my num_classes=2, all values above and equal to 2 seems to be ignored". How can we get this information? If I want use the object as 255, how can we change my code? Thanks!
@lillyro you can divide the values in your image by 255, then your num_classes woulde also be 2, because you've got 0 as background and 1 as the object.
Hi @zhaolewen, Could you please share the configuration settings for train.py for your case? My loss is oscillating around 0.2.
@omair50 find any solution ?
After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:
num_classes
= num of fore-ground object class count + background, thus in my case it's 2ignore_label=255
means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)- the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class.
- in the label png, 0 is the background. no not set
ignore_label=0
(to not complicate things)I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.
I my case, I labelled the object pixels as 128, but since my
num_classes=2
, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.
A late question, were the images integers or floats (labels?)
Thank you, I followed this one, and it works, the model learns, although the final performance I am getting is lower than the one I get with the Unet. I also add sigmoid activation to the model output and use bce + jaccard loss.
@kirk86 did you solve the issue because i have the same problem i trained with 7000 iterations the loss=0.2 (fixed between 0.3 0.1) my data=1500 images(300*400) crop_size=513 should i do more training or what
@essalahsouad unfortunately not with deeplab, but you might wanna look elsewhere. Also 1500 images seems a bit low for deep learning, usually any deep learning model requires at least 3000-5000 samples to have good generalization.
@kirk86 thank you
@Raj-08, sir, in your first comment, what do you mean by "data should be unit8"+ "Divide it by np.max i.e data/np.max(data)" knowing that I'm facing the same problem, my trained model predicts a black mask my data=RGB images+(0,1) labels
https://github.com/tensorflow/models/issues/3739#issuecomment-402583877
After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important: num_classes = num of fore-ground object class count + background, thus in my case it's 2 ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255) the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class. in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things) I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png. I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum. By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.
@zhaolewen thank you for being so descriptive, I'm facing the same issue my trained model predicts a black mask
i followed your suggestion
data=RGB images+(0,1)labels
class=2(1 object+background)
ignore_label=255
even after 1000 iteration I got the same result, blank mask
can you help me
I don't know if there is anyone still bothered by the same question. I'm also new to tensorflow, but I've trained and get quite decent result with deeplabv3+ implemented by pytorch, under the same data.
Here is what I found. I just faced the same problem with class=2. Beside the back segmentation result from vis.py, I also found I get unreasonable miou =1.0 from eval.py. At the beginning I thought there might be some bug inside eval.py or even tf.metrics.mean_iou. However, after I double check the label (by tf.unique) extracted from tfrecord file, I suddenly realize that the reason I got miou=1.0 and black segmentation result is that all of my labels are assigned to 0. It all make sense and there is no bug between this file. The problem comes from the tf.decode_png at data_generator.py.
The tf.decode_png function will decode the png file with type uint8 by default, but my mask generate from the labelme is save as uint16.
So the solution to this problem is either pass the dtype=tf.dtypes.uint16 argument to tf.decode_png, or change the datatype of mask to uint8. I choose the latter one in case there is another tf.decode_png.
Everything seems to be working properly training/evaluation etc. except from the fact that deeplab doesn't predict the segmentation masks.
Example:
The original images in the dataset are either colored like the above one or black and white, but all the masks are black and white.