[Struct2depth] LossTensor is inf or nan : Tensor had NaN values

kadut99 commented 5 years ago

System information

What is the top-level directory of the model you are using: tensorflow/models/struct2depth
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): From pip install tensorflow-gpu
TensorFlow version (use command below): 1.10
Bazel version (if compiling from source): -
CUDA/cuDNN version: CUDA 9 Cudnn 7
GPU model and memory: GTX 1080 8GB
Exact command to reproduce:

Source code / logs

Update : I have solve my previous problem, that is caused by the out of memory of my GPU, I change the batch_size to 1, but is show another error, the error show that there is a nan value in the training, as https://github.com/tensorflow/models/issues/6043 said, its because some mask is too small. Is there any one have solved this problem??

My dataset : 0000000001 0000000001-fseg

Error : InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values

ymodak commented 5 years ago

@aneliaangelova Can you please take a look? Thanks!

aneliaangelova commented 5 years ago

Are these images actually for training? For example the masks are expected to be single channel where each objects gets a label id (not color coded).

kadut99 commented 5 years ago

Are these images actually for training? For example the masks are expected to be single channel where each objects gets a label id (not color coded).

Thank for replying on my issues Yes that image I used for training with XXXX-fseg.png format file, is single channel means that its a grayscale color with different value between 1-255 for every different objects? I tried so many example of dataset like in the picture below : First image :

Second image :

Third image :

Fourth image :

Update : Fifth image : This image is already 1 channel (bit depth = 8), but still showing the same error, do you have some suggestion about that? 0000000001-fseg - Copy

From all of these image which one is the correct one? because I tried to train with these kind of dataset but still have an error : LossTensor is inf or nan : Tensor had NaN values , is it possible that the small mask caused an error, because in another issues said that small masked object caused an error? Or do you have the example image for training (include rgb image & segmentation mask)?

And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?

aneliaangelova commented 5 years ago

The fseg images should be as the "fifth" case. It is best if they start from 1 i.e. 1 for the first object 2 for the second, etc. You can also debug by feeding in all black images, this way can know if it comes from the seg masks. Many people have ran this without issues (with the exception of one nan reported a while ago) so it should be something simple. Please try to debug on your side.

kadut99 commented 5 years ago

The fseg images should be as the "fifth" case. It is best if they start from 1 i.e. 1 for the first object 2 for the second, etc. You can also debug by feeding in all black images, this way can know if it comes from the seg masks. Many people have ran this without issues (with the exception of one nan reported a while ago) so it should be something simple. Please try to debug on your side.

Okay Thank you for you information, I will recheck my segmentation dataset first

zhangzhensong commented 5 years ago

@kadut99 Hi, have you solved the problem?

zhangzhensong commented 5 years ago

The bug might result from small masks, and can be solved by manually checking the training data. Specially, we set the batchsize=1, set the shuffle=False when initializing train model, and delete the corresponding problematic line of file "train.txt". In my processed kittit dataset, I need to delete about 20 lines. Please comment and let me know if there are any other automatic methods.

 train_model = model.Model(data_dir=FLAGS.data_dir,
                            shuffle=False,
                            file_extension=FLAGS.file_extension,
                            is_training=True,
                            learning_rate=FLAGS.learning_rate,
                            beta1=FLAGS.beta1,
                            reconstr_weight=FLAGS.reconstr_weight,
                            smooth_weight=FLAGS.smooth_weight,
                            ssim_weight=FLAGS.ssim_weight,
                            icp_weight=FLAGS.icp_weight,
                            batch_size=FLAGS.batch_size,
                            img_height=FLAGS.img_height,
                            img_width=FLAGS.img_width,
                            seq_length=FLAGS.seq_length,
                            architecture=FLAGS.architecture,
                            imagenet_norm=FLAGS.imagenet_norm,
                            weight_reg=FLAGS.weight_reg,
                            exhaustive_mode=FLAGS.exhaustive_mode,
                            random_scale_crop=FLAGS.random_scale_crop,
                            flipping_mode=FLAGS.flipping_mode,
                            depth_upsampling=FLAGS.depth_upsampling,
                            depth_normalization=FLAGS.depth_normalization,
                            compute_minimum_loss=FLAGS.compute_minimum_loss,
                            use_skip=FLAGS.use_skip,
                            joint_encoder=FLAGS.joint_encoder,
                            handle_motion=FLAGS.handle_motion,
                            equal_weighting=FLAGS.equal_weighting,
                            size_constraint_weight=FLAGS.size_constraint_weight)

aj96 commented 5 years ago

@zhangzhensong I can confirm that inf loss is due to small masks. Because the pixel height of a mask is calculated as y_max - y_min in model.py, if there is a mask where the y_max and y_min are the same, then we will end up dividing by zero. I suppose a faster way to fix this would be to not keep masks that have the same y_max and y_min when generating your seg images. However, isn't the problem that the pixel height is being incorrectly calculated? Shouldn't the pixel height be y_max - y_min + 1? If the y_max and y_min are the same, that means you have a row of pixels. 1 row of pixels has a height of one, not zero? Can @aneliaangelova or @VincentCa comment on this?

aj96 commented 5 years ago

Btw, @zhangzhensong, can you clarify how you determined which training examples were causing the inf loss? I did as you suggested with setting shuffle to false and the batch size to 1, but no matter how I extract the bad training example, I can't reproduce the inf loss error by only training with that one example. I tried taking the example at the line number in the text file that corresponded to the step number; I even tried using the img path given by img_reader.read() in reader.py. I think I do not understand how the training examples are being read in.

wenxin-bupt commented 5 years ago

@zhangzhensong I can confirm that inf loss is due to small masks. Because the pixel height of a mask is calculated as y_max - y_min in model.py, if there is a mask where the y_max and y_min are the same, then we will end up dividing by zero. I suppose a faster way to fix this would be to not keep masks that have the same y_max and y_min when generating your seg images. However, isn't the problem that the pixel height is being incorrectly calculated? Shouldn't the pixel height be y_max - y_min + 1? If the y_max and y_min are the same, that means you have a row of pixels. 1 row of pixels has a height of one, not zero? Can @aneliaangelova or @VincentCa comment on this?

Solved my "NaN" problem! For the tracelog in my case, only the params on depth computation-graph are traced out. And the constraint loss only affects the depth computation.

nowburn commented 5 years ago

@kadut99 Hi, I have the same problem with you. How do you solve that? Thanks!

liyingliu commented 5 years ago

@nowburn I am having the same nan problem and y_max - y_min + 1 is not helping in my case. Have you tried y_max - y_min + 1?

nowburn commented 5 years ago

@liyingliu Yes, I did that, but It doesn't help. However, when I feed in all black images(xx-fseg.png), it works. So the problem still comes from the seg masks, But I don't know how to solve that now.

wenxin-bupt commented 5 years ago

@nowburn @liyingliu Have you tried ignoring the constraint loss?

kadut99 commented 5 years ago

I've successfully run the training You might give a special attention on your segmentation image (-fseg.png file) Make sure that you segmentation image is 1 channel image (like shown in this image https://user-images.githubusercontent.com/48504269/54731145-7e3cce00-4bc7-11e9-9518-9d028ee511f7.png)

liyingliu commented 5 years ago

@wenxin-bupt I only have nan when I enable the constraint loss. By the way, I am able to train the model with moving objects (no constraint loss). However, I am not able to reproduce a similar result as the paper (my abs rel is 0.1587, quite far from the authors' 0.1412). Have you able to reproduce a similar result?

wenxin-bupt commented 5 years ago

@wenxin-bupt I only have nan when I enable the constraint loss. By the way, I am able to train the model with moving objects (no constraint loss). However, I am not able to reproduce a similar result as the paper (my abs rel is 0.1587, quite far from the authors' 0.1412). Have you able to reproduce a similar result?

No. I canceled the training early. And I got 0.1566.

liyingliu commented 5 years ago

@wenxin-bupt Thanks for the information. So your nan problem exists only when you enable the constraint loss as well? And you able to fix it with y_max - y_min + 1?

wenxin-bupt commented 5 years ago

@wenxin-bupt Thanks for the information. So your nan problem exists only when you enable the constraint loss as well? And you able to fix it with y_max - y_min + 1?

Yes.

nowburn commented 5 years ago

@kadut99 My segmentation image is 1 channel image like this:(sample) Do I need close the constraint loss to solve NaN? Can you share some training data and fixed code? Thanks!

kadut99 commented 5 years ago

@kadut99 My segmentation image is 1 channel image like this:(sample) Do I need close the constraint loss to solve NaN? Can you share some training data and fixed code? Thanks!

I train from this repository https://github.com/ferdyandannes/struct2depthv2 I made some changes following that repos

ezorfa commented 5 years ago

@kadut99 Hi! I have few questions. I am grateful for your help:

1) "And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?" .... Do you stand with this procedure?

2) Could you please share the code related to MaskRCNN that you used to get segmentation masks?

Thankyou!

kadut99 commented 5 years ago

@kadut99 Hi! I have few questions. I am grateful for your help:

"And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?" .... Do you stand with this procedure?

Could you please share the code related to MaskRCNN that you used to get segmentation masks?

Thankyou!

you can open this link : https://github.com/ferdyandannes/struct2depth_train , that's my friend repository, you can train by that repository Run this following step :

demo.ipynb (Generate Mask Image)
gen_data_custom.py (Combine 3 RGB Image)
gen_data_custom_fseg.py (Combine 3 Mask Image and perform alignment)
train it (command in use.txt)

But you need to make an adjustment about the folder

ezorfa commented 5 years ago

@kadut99 Please give me one more information. Could you tell me how long did it take for you get this model (struct2depth) to train, if you replicated or made some improvements?

kadut99 commented 5 years ago

@kadut99 Please give me one more information. Could you tell me how long did it take for you get this model (struct2depth) to train, if you replicated or made some improvements?

I forgot how long it took I train for more than 3000000 steps, and I trained it using resnet18 imagenet pre-trained model

mikibella commented 2 years ago

@kadut99 Hey kadut thank you for your tipps. I'm also doing some testing with the struct2depth framework. I'm having trouble finding the resnet18 pretrained model for Tensorflow. I found some converting tools from pytorch to tf but they seem to not do it right, or I'm doing the mistake. Do you mind sharing your resnet18 checkpoint?

tensorflow / models

[Struct2depth] LossTensor is inf or nan : Tensor had NaN values #6392

System information

Source code / logs