README instructions not working for training on my own dataset

ogail commented 6 years ago

Hi, I tried to follow README instructions for training on my own dataset but it didn't work. Here is what I did:

Update DATA_DIR to point to dataset dir
Update DATA_LIST_PATH to point to train dataset list file.
Update INPUT_SIZE to '1280, 720'
Update NUM_CLASSES to 1
Update LAMBDA1 and LAMBDA2 to 0.4 and 0.6 respectively.

Then ran cmd

python train.py --update-mean-var --train-beta-gamma

Then got this error (shortened)

ValueError: Dimension 3 in both shapes must be equal, but are 1 and 19. Shapes are [1,1,128,1] and [1,1,128,19]. for 'conv6_cls_1/Assign' (op: 'Assign') with input shapes: [1,1,128,1], [1,1,128,19].

Troubleshooting (none of that worked):

Tried to follow advise from https://github.com/hellochick/ICNet-tensorflow/issues/20 by doing the following
Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to 1

Then replaced this line in train.py

restore_var = tf.global_variables()

with

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

Then I got same exact error mentioned above.

If anyone was able to train on their own dataset (either using pretrained model or from scratch) please provide steps of changes you did.

Thanks

hellochick commented 6 years ago

Hey @ogail, Since by default is to load pre-trained model and keep finetuning on it. However, the pre-trained cityscapes has 19 classes, while your dataset has only 1. You can comment line 191 to solve the problem, training from scratch.

ogail commented 6 years ago

I think you mean line 189 which is: net.load(args.restore_from, sess)

I tried and it results in loss being ‘nan’

i also tried to load from a saved checkpoint (instead of numpy) however issue was that loss is fixed at 0.511 and these sub4 =0.000 sub24 =0.000 sub124 =0.000 do not change at all.

Any ideas?

hellochick commented 6 years ago

Before that, I want to know what your dataset look like, can you show some examples? If there is only one class, it doesn't need to train anymore, am I right?

ogail commented 6 years ago

The dataset has 2 class obstacles (0) and non-obstacles (255) in a binary format. Here is an example of raw image This is an example of label image (similar to labelTrainIds images in cityscapes)

Think of this as semantic segmentation with two labels (background and foreground). Hope it makes sense. FYI i set the IGNORE_LABEL to 0

hellochick commented 6 years ago

It make sense to me. For this case, I think it's difficult to learn to detect obstacles, since the obstacles contain several different objects. Hence, I think you can restore a pre-trained ImageNet, or ADE20k segmentation, and set the learning rate much lower to try on this task.

Btw, I have tried to do the obstacle detection before, and you can refer to Indoor Segmentation. In this project, I detect obstacles by training on ADE20k, and I compressed num_classes from 150 to 27, just for your reference.

Danzip commented 6 years ago

Im trying to do something similar with LFW data set http://vis-www.cs.umass.edu/lfw/part_labels/ i've set num_classes to 3 and rearranged masks so that mask is a gray scale image where 0 is hair, 1 is face and 2 is background. I also removed the net.load line on the code the error im getting is when the line loss = tf.nn.sparse_softmax_cross_entropy_with_logits is being called. ValueError: Rank mismatch: Rank of labels (received 1) should equal rank of logits minus 1 (received 1).

Can u please explain what the function create_loss expects as input? what is the shape of output, label? when i try it i get label of shape (16,250,250,1) and output of shape (16,15,15,3) after reshaping raw_pred is of shape=(10800,) but label is of shape=(3600,) there is a mismatch here and i suspect its why the function fails, but I cant seem to understand what to do.

ogail commented 6 years ago

@hellochick I finally got it working, here are steps I did:

commenting net.load line
Setting number of classes to 2
Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100) Then trained network and got good prediction results (I had to update inference.py and tools.py to get this working): Here is original image

What I did for training is following:

Run python train.py for 8 hrs until loss reached 0.281, then stopped.
Run xpython train.py --update-mean-var --train-beta-gamma (still running) and loss is dropping to 0.27 and continuing.

when you trained on other datasets, how (meaning how long and what's purpose) do you use train.py and train.py --update-mean-var --train-beta-gamma

bhadresh74 commented 6 years ago

@ogail Thank you for the information you provided. Any change, you could make your script public? It would help us a lot. Thank you in advance

ogail commented 6 years ago

@bhadresh74 is there a specific question you have?

bhadresh74 commented 6 years ago

@ogail Yes. Couple of them actually. 1) I trained on two classes but my loss seems to be stuck at 0.6 and not going down. Here are my HP: batch size: 64 Steps: 60000 Others are as given in the repo.

2) While inference, how can I extract probability for each class. The given code returns 0 probability for each pixel for some reason. I would like to know how did you extract the softmax logits?

Thank you

ogail commented 6 years ago

@bhadresh74 Here are some suggestions: 1- getting loss to 0.6 is good indication, pushing it more will require some tinkering like:

increasing number of training steps
increasing batch size
checking to see if ground truth labels has some errors that are consistently failing. 2- I have not tried to extract the probability before.

BCJuan commented 6 years ago

Hi, I would like to make a question for you, @ogail , since I had the same problems: I see that you have done the following:

commenting net.load line
Setting number of classes to 2
Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100)

But have you also made the changes that you stated at the beginning? Manly:

Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to 1
Then replaced this line in train.py

restore_var = tf.global_variables()

with

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in _v.name]

ogail commented 6 years ago

@BCJuan excuse me for late reply. Yes I did both changes as well.

qmy612 commented 6 years ago

@hellochick @ogail hello, my question is that, if my datasets is 2 class, I can only use this network by training from scratch? Can't use the previous layers of pre-trained mode or just train the last cls layer? Because in my experiment with caffe, I can train based on pre-trained models. I am not familiar with tf, but the deeplabv3+ tensorflow can also support only training the last layer.

ogail commented 6 years ago

Yes, you will have to train from scratch

BCJuan commented 6 years ago

Hi, In response to @qmy612 (also @ogail ): you can indeed use the pretrained model.

I achieved it yesterday doing the following:

As in #20 do: Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to your number of classes. (this is from @ogail initial question)
Then go to network.py, to the load function of class Network and add the following line:if 'conv6_cls' not in var.name: before the line session.run(var.assign(data)). Also change ignore_missing to True

The function should look something like:

def load(self, data_path, session, ignore_missing=True):
        data_dict = np.load(data_path, encoding='latin1').item()
        for op_name in data_dict:
            with tf.variable_scope(op_name, reuse=True):
                for param_name, data in data_dict[op_name].items():
                    try:
                        if 'bn' in op_name:
                            param_name = BN_param_map[param_name]

                        var = tf.get_variable(param_name)
                        if 'conv6_cls' not in var.name:
                            session.run(var.assign(data))
                    except ValueError:
                        if not ignore_missing:
                            raise

Then, you can make the change stated in #20, I mean changing:

restore_var = tf.global_variables()

by

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

or not.

Indeed it would have the same effect since you have not loaded the conv6_cls from the pretrained model, which is the last layer (classification) of the net.

Hope this helps.

ogail commented 6 years ago

@BCJuan did fine-tuning from pretrained model boosted on your custom task? Have u tried to compare that vs training from scratch?

BCJuan commented 6 years ago

Yes, it boosted the results. Indeed I was not obtaining any good results without the pretrained model.

I used the icnet_cityscapes_trainval_bnomerge_90k, but I think that any other model can be used.

ogail commented 6 years ago

@BCJuan what's mIoU before and after using cityecapes pretrained model?

ogail commented 6 years ago

@BCJuan I did load the pretrained model however didn't see much diff between fine-tuning vs training from scratch.

BCJuan commented 6 years ago

@ogail I do not know since I am just finetunning. But with a one hour run, using the pretrained model I achieve like 20% mIoU while without it 6%. Maybe I am doing something wrong.

qmy612 commented 6 years ago

@BCJuan Thank you very much, I will try tomorrow.

seovchinnikov commented 6 years ago

I'll try finetuning too and will report the results. But from my experience finetuning always gives the boost in tendency to generalization of the model so it's nice to try

VincentGu11 commented 6 years ago

Hi @ogail, Thank you very much for sharing your training steps for us. Recently, I need to solve the same problem like you, I set my network parameter to the same like yours and the loss can become to 0.17 and keep going down. However, when I inference my net, the result shows all the image came to 0 or 1, it seems not quite right. Did you have this problem? Thank you!

PratibhaT commented 6 years ago

@ogail Have you tried to train it for multiple classes? What annotating tool I can use for training it on multiple classes? Also what is the accuracy and fps you are getting on evaluation?

ogail commented 6 years ago

@VincentGu11 is the 0 and/or 1 are how the final rendered image looks like? There's function decode_label that converts training index to RGB color

@PratibhaT yes I tried. you could use labelme tool. The accuracy and fps depends on the data and the problem so my numbers wont be relevant in general sense.

PratibhaT commented 6 years ago

@ogail I used VIA annotation tool, which gives .json file. But in this code the list.txt refers to .png image for label. Is there a way to convert a .json annotation files to .png to be used as label. What is the output of labelme tool?

adisrivasa commented 6 years ago

@ogail I am training it for my own dataset consisting of 8 classes. I did all the required changes mentioned above but i am still getting the following error :-

Assign requires shapes of both tensors to match. lhs shape= [8] rhs shape= [19]

Is there some particular change that i missed out?

Soulempty commented 6 years ago

@qmy612 ,Can you share some details about your training with caffe framework? I have the problem to train with matcaffe downloaded .

yeyuanzheng177 commented 6 years ago

@ogail Thank you for the information you provided. Can I ask you two questions? 1.Did you use the ADE20k or any other pre-training model to fine-tune when training your own datasets?

2.What is the basis for setting the IGNORE_LABEL value? Looking forward to your answer.

ogail commented 6 years ago

@PratibhaT I did not search for such tool however I'd just do conversion myself to get going.

@adisrivasa This seems that (1) number of classes in train.py is not set to 8 or (2) the protoxtx file for pretrained checkpoint is not updated to use 8 classes instead of 18

@yeyuanzheng177 (1) I did not use ADE20k for fine-tuning instead I used cityscapes. (2) I set this value to 255

Soulempty commented 6 years ago

@qmy612 ,hello,,can you give some advice on how to train ICNet on caffe,Thank you for your help

yeyuanzheng177 commented 6 years ago

@ogail Thank you for the information you provided. My data set is the same as yours. (0,0,0) and (255,255,255) are represented by two categories of tags. When I set IGNORE_LABEL = 0, the result is Sub4 =nan Sub24 =nan Sub124 =nan When I set IGNORE_LABEL not to 0, the result is Step 0 total loss = 3.639, sub4 = 0.471, sub24 = 0.857, sub124 = 1.916 (3.606 sec/step) Step 1 total loss = 1.897, sub4 = 0.281, sub24 = 0.521, sub124 = 0.451 (0.161 sec/step) Step 2 total loss = 1.342, sub4 = 0.180, sub24 = 0.267, sub124 = 0.088 (0.162 sec/step) Step 3 total loss = 1.328, sub4 = 0.181, sub24 = 0.330, sub124 = 0.033 (0.158 sec/step) Step 4 total loss = 1.173, sub4 = 0.108, sub24 = 0.160, sub124 = 0.007 (0.161 sec/step) Step 5 total loss = 1.132, sub4 = 0.129, sub24 = 0.074, sub124 = 0.006 (0.159 sec/step) Step 6 total loss = 1.411, sub4 = 0.056, sub24 = 0.028, sub124 = 0.339 (0.160 sec/step) Step 7 total loss = 1.055, sub4 = 0.033, sub24 = 0.009, sub124 = 0.001 (0.158 sec/step) Step 8 total loss = 1.049, sub4 = 0.018, sub24 = 0.004, sub124 = 0.001 (0.160 sec/step) Step 9 total loss = 1.055, sub4 = 0.025, sub24 = 0.006, sub124 = 0.000 (0.158 sec/step These results have troubled me. I just set up the program according to your description, whether it is training from scratch or fine tuning.But it did not work. Can you tell me where your tips for training the network are? Looking forward to your answer.

seushengchao commented 5 years ago

@VincentGu11 Hello. Do you solve the problem( the result shows all the image came to 0 or 1) ?? Thank you! I have the same problem as you.

abreheret commented 5 years ago

I finally got it working, here are steps I did:

commenting net.load line

Setting number of classes to 2

Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100) Then trained network and got good prediction results (I had to update inference.py and tools.py to get this working): Here is original image

What I did for training is following:

Run python train.py for 8 hrs until loss reached 0.281, then stopped.

Run xpython train.py --update-mean-var --train-beta-gamma (still running) and loss is dropping to 0.27 and continuing.

when you trained on other datasets, how (meaning how long and what's purpose) do you use train.py and train.py --update-mean-var --train-beta-gamma

Cool, you have succeeded !

I am also learning on my own data, and I would like to know how many images annotated do you have for a satisfactory result (@ogail )?

erichhhhho commented 5 years ago

Sorry. I was wondering if you guys was using pretrianed _icnet_cityscapesbnnomerge.prototxt instead of _icnet_cityscapes_trainval_90kbnnomerge.npy @ogail @hellochick

So, how could I update the pretrained model by changing conv6_cls num_output from 19 to 1 @BCJuan

hellochick commented 5 years ago

Hey @erichhhhho,

You need to change the restore variables just like restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name], thus you can restore the pre-trained weights except the last layer.

amwfarid commented 5 years ago

Hey @erichhhhho,

You need to change the restore variables just like restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name], thus you can restore the pre-trained weights except the last layer.

I still get the same problem even though I set restore_var without conv6_cls (For model retraining using .npy). Am I missing something?

MarcSchotman commented 5 years ago

For me this worked: 1) set to in network.py set ignore_missing to True:

def load(self, data_path, session, ignore_missing=True):

2) Edit INFER_SIZE, TRAINING_SIZE and the whole dict of others_param

3) In train.py change

 restore_var = tf.global_variables()

to

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

4) run

python train.py --dataset others

kangyang94 commented 5 years ago

@ogail

Now the project don't have inference.py and tool.py, do you still have the version you used?

amwfarid commented 5 years ago

@kangyang94

At least for inference.py, it actually exists as a python notebook (demo.ipynb).

prz30 commented 5 years ago

@hellochick I finally got it working, here are steps I did:

commenting net.load line

Setting number of classes to 2

Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100) Then trained network and got good prediction results (I had to update inference.py and tools.py to get this working): Here is original image

What I did for training is following:

Run python train.py for 8 hrs until loss reached 0.281, then stopped.

Run xpython train.py --update-mean-var --train-beta-gamma (still running) and loss is dropping to 0.27 and continuing.

when you trained on other datasets, how (meaning how long and what's purpose) do you use train.py and train.py --update-mean-var --train-beta-gamma

Hi @ogail Please forgive me for disturbing you. Can you give me a copy of your code at that moment? Since the author iterative version, I found that there may be differences in some code-modified changes. Thank you! My email address is yf20130607@163.com Finally, please forgive me for my worse English

seushengchao commented 5 years ago

@ogail Thank you for the information you provided. My data set is the same as yours. (0,0,0) and (255,255,255) are represented by two categories of tags. When I set IGNORE_LABEL = 0, the result is Sub4 =nan Sub24 =nan Sub124 =nan When I set IGNORE_LABEL not to 0, the result is Step 0 total loss = 3.639, sub4 = 0.471, sub24 = 0.857, sub124 = 1.916 (3.606 sec/step) Step 1 total loss = 1.897, sub4 = 0.281, sub24 = 0.521, sub124 = 0.451 (0.161 sec/step) Step 2 total loss = 1.342, sub4 = 0.180, sub24 = 0.267, sub124 = 0.088 (0.162 sec/step) Step 3 total loss = 1.328, sub4 = 0.181, sub24 = 0.330, sub124 = 0.033 (0.158 sec/step) Step 4 total loss = 1.173, sub4 = 0.108, sub24 = 0.160, sub124 = 0.007 (0.161 sec/step) Step 5 total loss = 1.132, sub4 = 0.129, sub24 = 0.074, sub124 = 0.006 (0.159 sec/step) Step 6 total loss = 1.411, sub4 = 0.056, sub24 = 0.028, sub124 = 0.339 (0.160 sec/step) Step 7 total loss = 1.055, sub4 = 0.033, sub24 = 0.009, sub124 = 0.001 (0.158 sec/step) Step 8 total loss = 1.049, sub4 = 0.018, sub24 = 0.004, sub124 = 0.001 (0.160 sec/step) Step 9 total loss = 1.055, sub4 = 0.025, sub24 = 0.006, sub124 = 0.000 (0.158 sec/step These results have troubled me. I just set up the program according to your description, whether it is training from scratch or fine tuning.But it did not work. Can you tell me where your tips for training the network are? Looking forward to your answer.

Hello, have you solved the problem? It also troubles for a long time!

Mythos-Rudy commented 5 years ago

The main reason for this problem is the function create_loss() in train.py , the author ignore the background when compute loss，so that if your NUM_CLASSES is 2，and ignore one of them，loss will be very low if model always predict the pixel to another one. So your loss is reach to 0.5(because l2loss is 0.5), but the model learned nothing. To solve the problem, you have to change the code about ignore one classes

SpencerTrihus commented 4 years ago

For me this worked:

1. set to in network.py set ignore_missing to True:

def load(self, data_path, session, ignore_missing=True):

1. Edit `INFER_SIZE`, `TRAINING_SIZE` and the whole dict of `others_param`

2. In train.py change

 restore_var = tf.global_variables()

to

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

1. run

python train.py --dataset others

I am trying to train ICNet on a custom dataset with 2 classes, the background and the object, but I received an error due to cityscapes having 19 classes while mine only includes 2. I have followed the instructions above, which seem to have solved the class problem but now during training all parameters are nan.

I do not understand what is recommended in #20 and in this thread regarding making changes to .prototxt and .npy files. If this is necessary, could you explain how to make this change? If not, what is causing the nan loss results?

Thanks!

gitunit commented 4 years ago

how do i change the number of classes for the pre-trained model? i've found the .prototxt file from the original work but i don't know what to do with it, where to load it.

hellochick / ICNet-tensorflow

README instructions not working for training on my own dataset #50