Closed kadut99 closed 5 years ago
@aneliaangelova Can you please take a look? Thanks!
Are these images actually for training? For example the masks are expected to be single channel where each objects gets a label id (not color coded).
Are these images actually for training? For example the masks are expected to be single channel where each objects gets a label id (not color coded).
Thank for replying on my issues Yes that image I used for training with XXXX-fseg.png format file, is single channel means that its a grayscale color with different value between 1-255 for every different objects? I tried so many example of dataset like in the picture below : First image :
Second image :
Third image :
Fourth image :
Update : Fifth image : This image is already 1 channel (bit depth = 8), but still showing the same error, do you have some suggestion about that?
From all of these image which one is the correct one? because I tried to train with these kind of dataset but still have an error : LossTensor is inf or nan : Tensor had NaN values , is it possible that the small mask caused an error, because in another issues said that small masked object caused an error? Or do you have the example image for training (include rgb image & segmentation mask)?
And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?
The fseg images should be as the "fifth" case. It is best if they start from 1 i.e. 1 for the first object 2 for the second, etc. You can also debug by feeding in all black images, this way can know if it comes from the seg masks. Many people have ran this without issues (with the exception of one nan reported a while ago) so it should be something simple. Please try to debug on your side.
The fseg images should be as the "fifth" case. It is best if they start from 1 i.e. 1 for the first object 2 for the second, etc. You can also debug by feeding in all black images, this way can know if it comes from the seg masks. Many people have ran this without issues (with the exception of one nan reported a while ago) so it should be something simple. Please try to debug on your side.
Okay Thank you for you information, I will recheck my segmentation dataset first
@kadut99 Hi, have you solved the problem?
The bug might result from small masks, and can be solved by manually checking the training data. Specially, we set the batchsize=1, set the shuffle=False
when initializing train model, and delete the corresponding problematic line of file "train.txt". In my processed kittit dataset, I need to delete about 20 lines. Please comment and let me know if there are any other automatic methods.
train_model = model.Model(data_dir=FLAGS.data_dir,
shuffle=False,
file_extension=FLAGS.file_extension,
is_training=True,
learning_rate=FLAGS.learning_rate,
beta1=FLAGS.beta1,
reconstr_weight=FLAGS.reconstr_weight,
smooth_weight=FLAGS.smooth_weight,
ssim_weight=FLAGS.ssim_weight,
icp_weight=FLAGS.icp_weight,
batch_size=FLAGS.batch_size,
img_height=FLAGS.img_height,
img_width=FLAGS.img_width,
seq_length=FLAGS.seq_length,
architecture=FLAGS.architecture,
imagenet_norm=FLAGS.imagenet_norm,
weight_reg=FLAGS.weight_reg,
exhaustive_mode=FLAGS.exhaustive_mode,
random_scale_crop=FLAGS.random_scale_crop,
flipping_mode=FLAGS.flipping_mode,
depth_upsampling=FLAGS.depth_upsampling,
depth_normalization=FLAGS.depth_normalization,
compute_minimum_loss=FLAGS.compute_minimum_loss,
use_skip=FLAGS.use_skip,
joint_encoder=FLAGS.joint_encoder,
handle_motion=FLAGS.handle_motion,
equal_weighting=FLAGS.equal_weighting,
size_constraint_weight=FLAGS.size_constraint_weight)
@zhangzhensong I can confirm that inf loss is due to small masks. Because the pixel height of a mask is calculated as y_max - y_min
in model.py, if there is a mask where the y_max and y_min are the same, then we will end up dividing by zero. I suppose a faster way to fix this would be to not keep masks that have the same y_max and y_min when generating your seg images. However, isn't the problem that the pixel height is being incorrectly calculated? Shouldn't the pixel height be y_max - y_min + 1
? If the y_max and y_min are the same, that means you have a row of pixels. 1 row of pixels has a height of one, not zero? Can @aneliaangelova or @VincentCa comment on this?
Btw, @zhangzhensong, can you clarify how you determined which training examples were causing the inf loss? I did as you suggested with setting shuffle to false and the batch size to 1, but no matter how I extract the bad training example, I can't reproduce the inf loss error by only training with that one example. I tried taking the example at the line number in the text file that corresponded to the step number; I even tried using the img path given by img_reader.read()
in reader.py. I think I do not understand how the training examples are being read in.
@zhangzhensong I can confirm that inf loss is due to small masks. Because the pixel height of a mask is calculated as
y_max - y_min
in model.py, if there is a mask where the y_max and y_min are the same, then we will end up dividing by zero. I suppose a faster way to fix this would be to not keep masks that have the same y_max and y_min when generating your seg images. However, isn't the problem that the pixel height is being incorrectly calculated? Shouldn't the pixel height bey_max - y_min + 1
? If the y_max and y_min are the same, that means you have a row of pixels. 1 row of pixels has a height of one, not zero? Can @aneliaangelova or @VincentCa comment on this?
Solved my "NaN" problem! For the tracelog in my case, only the params on depth computation-graph are traced out. And the constraint loss only affects the depth computation.
@kadut99 Hi, I have the same problem with you. How do you solve that? Thanks!
@nowburn I am having the same nan problem and y_max - y_min + 1 is not helping in my case. Have you tried y_max - y_min + 1?
@liyingliu Yes, I did that, but It doesn't help. However, when I feed in all black images(xx-fseg.png), it works. So the problem still comes from the seg masks, But I don't know how to solve that now.
@nowburn @liyingliu Have you tried ignoring the constraint loss?
I've successfully run the training You might give a special attention on your segmentation image (-fseg.png file) Make sure that you segmentation image is 1 channel image (like shown in this image https://user-images.githubusercontent.com/48504269/54731145-7e3cce00-4bc7-11e9-9518-9d028ee511f7.png)
@wenxin-bupt I only have nan when I enable the constraint loss. By the way, I am able to train the model with moving objects (no constraint loss). However, I am not able to reproduce a similar result as the paper (my abs rel is 0.1587, quite far from the authors' 0.1412). Have you able to reproduce a similar result?
@wenxin-bupt I only have nan when I enable the constraint loss. By the way, I am able to train the model with moving objects (no constraint loss). However, I am not able to reproduce a similar result as the paper (my abs rel is 0.1587, quite far from the authors' 0.1412). Have you able to reproduce a similar result?
No. I canceled the training early. And I got 0.1566.
@wenxin-bupt Thanks for the information. So your nan problem exists only when you enable the constraint loss as well? And you able to fix it with y_max - y_min + 1?
@wenxin-bupt Thanks for the information. So your nan problem exists only when you enable the constraint loss as well? And you able to fix it with y_max - y_min + 1?
Yes.
@kadut99 My segmentation image is 1 channel image like this:(sample) Do I need close the constraint loss to solve NaN? Can you share some training data and fixed code? Thanks!
@kadut99 My segmentation image is 1 channel image like this:(sample) Do I need close the constraint loss to solve NaN? Can you share some training data and fixed code? Thanks!
I train from this repository https://github.com/ferdyandannes/struct2depthv2 I made some changes following that repos
@kadut99 Hi! I have few questions. I am grateful for your help:
1) "And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?" .... Do you stand with this procedure?
2) Could you please share the code related to MaskRCNN that you used to get segmentation masks?
Thankyou!
@kadut99 Hi! I have few questions. I am grateful for your help:
- "And to perform alignment.py, the input of the alignment is 3 frames of the instance segmentation dataset right?, then we combine the 3 aligned image into 1 image of 3 sequence frames with the names XXXXX-fseg.png?" .... Do you stand with this procedure?
- Could you please share the code related to MaskRCNN that you used to get segmentation masks?
Thankyou!
you can open this link : https://github.com/ferdyandannes/struct2depth_train , that's my friend repository, you can train by that repository Run this following step :
But you need to make an adjustment about the folder
@kadut99 Please give me one more information. Could you tell me how long did it take for you get this model (struct2depth) to train, if you replicated or made some improvements?
@kadut99 Please give me one more information. Could you tell me how long did it take for you get this model (struct2depth) to train, if you replicated or made some improvements?
I forgot how long it took I train for more than 3000000 steps, and I trained it using resnet18 imagenet pre-trained model
@kadut99 Hey kadut thank you for your tipps. I'm also doing some testing with the struct2depth framework. I'm having trouble finding the resnet18 pretrained model for Tensorflow. I found some converting tools from pytorch to tf but they seem to not do it right, or I'm doing the mistake. Do you mind sharing your resnet18 checkpoint?
System information
Source code / logs
Update : I have solve my previous problem, that is caused by the out of memory of my GPU, I change the batch_size to 1, but is show another error, the error show that there is a nan value in the training, as https://github.com/tensorflow/models/issues/6043 said, its because some mask is too small. Is there any one have solved this problem??
My dataset :
Error : InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values