struct2depth training data prep request for documentation

saeed68gm commented 5 years ago

Describe the problem

I did not find enough documentation on preparing the data for the training. I am trying to replicate the results for struct2depth on kitti or cityscapes dataset. However, I do not exactly know how to generate the data in the correct format. Mainly, I would like to know how to do the following:

generate the train.txt and valid.txt files for the dataset (currently I am using the script in vid2depth subdir to do that)
How to generate the segmentation results and what is the naming convention used? ( is it using Mask RCNN model? should it be -fseg.png?
It is mentioned that "It is assumed that motion masks are already generated and stored as images". Can you explain how to do this? what is the naming convention and how to generate this?

Thank you for sharing results of your work. This is a really impressive paper and your response is appreciated. @aneliaangelova @VincentCa

System information

models/research/struct2depth:
Stock code:
Centos 7:
Tensorflow from binary (pip):
v1.11.0-0-gc19e29306c 1.11.0:
CUDA 9/CUDNN 7
V100 32GB: python gen_data_kitti.py

tensorflowbutler commented 5 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

saeed68gm commented 5 years ago

Have I written custom code: No OS Platform and Distribution: Centos 7 TensorFlow installed from: Binary TensorFlow version: * v1.11.0-0-gc19e29306c 1.11.0 Bazel version: N/A CUDA/cuDNN version: 9.0/7.0 GPU model and memory: Volta V100 32GB Exact command to reproduce: python gen_data_kitti.py or python gen_data_city.py

VincentCa commented 5 years ago

Hi,

Please refer to method compile_file_list in reader.py to understand the structure of the input text files listing individual frame triplets. This function is equivalent to vid2depth's compile_file_list in reader.py. Basically you have a folder path and then a filename (without extension, this is passed as a flag) separated by a single space per line. For every input image you can have accompanying supplemental files, like this:

Input file: 0000.png Related aligned segmentation map: 0000-fseg.png (only needed if you train a motion model) Related camera intrinsics: 0000_cam.txt

The related line in your input text file would look like "some/sub/folder 0000" and your file_extension flag would be "png" in order to read it properly.

We used a pre-trained Mask-RCNN (on a different dataset). Note that the output of running this frame-by-frame gives you instance-level labels for each frame, but they are not temporally consistent, i.e. the same object will almost never have the same instance ID assigned across frames. Use alignment.py to align them, or (preferably) work on a nicer method to make the instance ID's temporally consistent. We call the Mask-RCNN raw output X-seg.png, while we call the aligned ones X-fseg.png. You need to use the latter for the model, it definitely expects aligned labels.
Refer also to 1) and 2); you need to run an instance segmentation model to obtain masks and then perform the alignment. Make sure to save the masks in a lossless image format to avoid compression artifacts compromising the label IDs.

Hope this helps! Vincent

ramanishka commented 5 years ago

Hi,

@VincentCa

We used a pre-trained Mask-RCNN (on a different dataset) Which dataset? Which Mask-RCNN (backbone)?

Is it possible somehow to reproduce quantitative results from the paper using the released pretrained models? As I understand, it is necessary to use instance segmentation masks in this case.

Thanks!

saeed68gm commented 5 years ago

@VincentCa Thank you for your response. It was very helpful.

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask. So I will put in a workaround to avoid that for now.

For generating the motion masks, is there requirements on the color codes? do we need to encode colors or can we just have rgb codes for every semantic label?

VincentCa commented 5 years ago

@ramanishka For inference, instance segmentation masks are not needed if you are only interested in depth and/or odometry prediction. Only if you want to specifically look at object motion prediction they would be needed to feed the input properly to the object motion estimator network.

@saeed68gm Yes, feel free to just work around that issue. There is no specific requirement on the color codes you use. Just make sure you save the masks in a lossless format (PNG), as I described earlier. Simply assign the same value across all channels. 0 always stands for background, so you can use 1-255 for different object instances across all channels. As I also mentioned, alignment is crucial. For each triplet you provide to the network, it expects to find every instance ID in every subframe also in every neighboring one, and it further expects these to actually corresond to the same object. If within the same triplet an instance ID appears in mask t but not mask t+1, this would be a problem. However, given that segmentation models are not perfect, don't worry if there are some triplets where masks are missing for an object throughout.

amanrajdce commented 5 years ago

@VincentCa Can you share standard training parameters that can be used to replicate result from paper by training model from scratch? Thanks for your help. I have been able to successfully run the training with objection motion mask, and now interested in replicating paper result and then try my own idea.

VincentCa commented 5 years ago

@amanrajdce Please refer to the paper to find the parameter settings we used. If you think some info is missing just let me know. @ferdyandannes You can simply disable using the joint encoder. If you pass --joint_encoder false as a flag, this error shouldn't occur.

amanrajdce commented 5 years ago

@VincentCa

Can you please provide the evaluation script that is used in your repo?
For online refinement process, how do you generate the triplets, can you share the script for the same?
My understanding is the checkpoint model provided with the repo on the project website, is the final model after online-refinement learning right? could you please confirm this.

Thank you very much.

aneliaangelova commented 5 years ago

The evaluation script is from the SFMLearner work of T. Zhou et al https://github.com/tinghuiz/SfMLearner

amanrajdce commented 5 years ago

@amanrajdce Please refer to the paper to find the parameter settings we used. If you think some info is missing just let me know.

I am interested in the number of iterations for model training, I can't find this information in paper.

amanrajdce commented 5 years ago

@VincentCa Can you give some info about: triplet_list_file="$data_dir/test_files_eigen_triplets.txt" triplet_list_file_remains="$data_dir/test_files_eigen_triplets_remains.txt" in the online-refinement stage? It looks like you use test_files_eigen.txt and generate triplets from it. If my understanding correct? Also, if you have any script for the same can you share if possible?

lucasjinreal commented 5 years ago

@amanrajdce Have u managed to train this?

PKUCheng commented 5 years ago

@VincentCa How can MaskRCNN segment Dynamic object? It can only produce instance mask. Maybe the instance is not a dynamic object.Have you considered this issue?

aneliaangelova commented 5 years ago

Yes, definitely. We use all masks (of moving and not moving objects) and estimate (meaning learn to estimate) each individual object's motion. If the object is not moving the estimated motion will be 0, which is exactly as desired. During inference no object masks are needed - as you can directly run the depth network only.

amanrajdce commented 5 years ago

@amanrajdce Have u managed to train this?

yes, i was able to train this model as well as predict on set of images. However, I wasn't able to run the online refinement stage, due to some lacking information in the paper on how to do it. I have asked the same question from authors on this thread but I haven't got any response yet.

amanrajdce commented 5 years ago

@VincentCa

Can you please provide the evaluation script that is used in your repo?

For online refinement process, how do you generate the triplets, can you share the script for the same?

My understanding is the checkpoint model provided with the repo on the project website, is the final model after online-refinement learning right? could you please confirm this.

Thank you very much.

@aneliaangelova could you please provide some response for 2. and 3. I understand it is not possible to share the code/script but if you can at least talk about the contents of these triplets that is supposed to be there, that will be great.

VincentCa commented 5 years ago

Hi,

You can refer to the evaluation script of SfMLearner (https://github.com/tinghuiz/SfMLearner) for inspiration on how to evaluate on KITTI. Make sure to replicate normalization steps, crops etc. as described in the paper. The far majority of related work uses the exact same evaluation procedure and parameters (e.g. cut-off at 50/80m, mean alignment), so their implementation might be helpful, too.
See reader.py for reference; triplets can be stored in ordinary image files by stitching seq_length many frames together horizontally. They will be split up and stacked in the channel axis, you can find this in unpack_images(). If you are trying your own dataset, it can certainly be helpful to try different degrees of subsampling (temporally). Ideally, it relates to the same subsampling you applied during training, of course. If, however, the movements are slower/faster during inference, it might be a good idea to adjust, or to implement an adaptive frame rate - because if in a triplet no movement is present at all or it's very subtle, (almost) no training signal will be provided to the network.
No, it is important to note that online-refinement is applied during inference only. We do not ever save checkpoints after running refinement, as the goal is only to allow online-adaption to produce higher-quality inference results, and not to improve the network weights persistently.

alexbarnett12 commented 5 years ago

@saeed68gm I am also trying to create a workaround for the "handle_motion" flag not working. Were you able to find a good workaround you could share?

saeed68gm commented 5 years ago

@alexbarnett12 It was not a very clean workaround. I just went into the code and commented everywhere it was using masks.

amanrajdce commented 5 years ago

@saeed68gm @alexbarnett12 you can look at fork of this repository. I have done necessary changes here https://github.com/amanrajdce/struct2depth

KawtarM commented 5 years ago

@amanrajdce I can't find the page you're refering to; the link is not working. I tried to find the repo in your git but it seems that it doesn't exist anymore!

tlalexander commented 5 years ago

@KawtarM I believe the link has moved to here: https://github.com/amanrajdce/struct2depth_pub

amanrajdce commented 5 years ago

sorry! please refer to the line that @tlalexander posted above. If you find something missing let me know.

KawtarM commented 5 years ago

Thanks guys!

PKUCheng commented 5 years ago

sorry! please refer to the line that @tlalexander posted above. If you find something missing let me know.

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask, Have you solved this issue? It seems that your code still have this issue

aneliaangelova commented 5 years ago

No we haven't but It is easy to fix, in that case a mask which is all 0's can be reused for all images.

nowburn commented 5 years ago

@VincentCa @amanrajdce Hi, I'm doing the data preprocess,and I run the script "alignment.py", the input (xx-seg.png)and output(xx-fseg.png) are as follows, https://github.com/nowburn/Show is the result right? maybe the script is wrong? Can you show me the correct output sample? Thanks!

VincentCa commented 5 years ago

Hi, you can’t pass the masks in that format - note that your current input is full RGB with half-transparent segmentation overlays and thus can’t be parsed correctly. The input to the script needs to be a simplified mask, where background is entirely black (0, 0, 0) and every different object in the image has a different shade of grey that is consistent across all channels - e.g. car1 has (255, 255, 255), car2 (254, 254, 254) and pedestrian1 (253, 253, 253).

Please also refer to other github issues covering this. Hope this helps!

rpetar commented 5 years ago

it seems like if you set the "handle_motion" flag to true or false, the script will still look for the segmentation mask, Have you solved this issue? It seems that your code still have this issue

No we haven't but It is easy to fix, in that case a mask which is all 0's can be reused for all images.

I'm not sure are the segmentation masks really used in the training process (when _handlemotion is False), or they are just loaded in the reader.py and not used at all (in the reader.py, the parameter _handlemotion is not used, unlike in the train.py)? Because, if they are not used, there is a simple solution (not the best, but the simplest).

Here is the part of the code, where the process of loading the masks starts. I've managed to get around this problem by simply renaming -fseg. with ., so instead masks, the images will be loaded again. The training process has started successfully, but I'm not sure is it actually a good solution.

liyingliu commented 5 years ago

@amanrajdce Hi, you have mentioned that you have been able to successfully run the training with objection motion mask. So have you been able to reproduce the similar result of what the authors report in the paper? For me, I am able to run the the training with objection motion mask but not able to reproduce similar result (my abs rel is 0.1587, quite far from the authors' 0.1412).

amanrajdce commented 5 years ago

@amanrajdce Hi, you have mentioned that you have been able to successfully run the training with objection motion mask. So have you been able to reproduce the similar result of what the authors report in the paper? For me, I am able to run the the training with objection motion mask but not able to reproduce similar result (my abs rel is 0.1587, quite far from the authors' 0.1412).

you can find the number here on page-5, https://github.com/amanrajdce/CSE-291D-Final-Project/blob/master/CSE_291D_Final_Project.pdf

nowburn commented 5 years ago

@VincentCa I'm sorry to disturb you again.But the problem really confuses me a lot. (1) My now problem is that 'Tensor NaN' happens when training. I refered the issureissue/6392, but they all don't work for me.

(2) My processed seg images are as follows: (every motion object is masked like(1,1,1),(2,2,2) just like you described before). Specifically, I use mask-rcnn generates xx-seg.png , and then use alignment.py to align them(for every 3 xx-seg.png images) and the final xx-fseg.png is like here fseg mask-rcnn, code segment for saving masked images

init_color = (1, 1, 1)

    for i in range(N):

        color = init_color
        init_color = [x + 1 for x in init_color]

(3) I can run the training by ignoring the 'motion constraint loss' in 'model.py-336', but the trained model cant't predict the motion object depth -model.py-336

#losses = tf.map_fn(
                                #     get_losses, object_masks, dtype=tf.float32)
                                # self.inf_loss += tf.reduce_mean(losses)

So How can I solve it? is the xx-fseg.png right? Thank your again!

manuel-88 commented 5 years ago

Hi, I have the same issue as the comment above. I tried all suggestions from issue/6392 but still I get the "LossTensor is inf or nan : Tensor had NaN values" error. Can someone help please. Thanks.

I tried now many things. Like input black images --> Which worked! Next step was to generate an image with square objects --> which also worked. If I change an id of just one square of the sequence or if I make the height of a square to 1 pixel --> it failed. So what are now the requirements to run the labels

No object with a pixel height of 1
all object ID`s need to be exist in each image in a sequence?
what else?

Hope someone can help.

poornimajd commented 4 years ago

Hi, you can’t pass the masks in that format - note that your current input is full RGB with half-transparent segmentation overlays and thus can’t be parsed correctly. The input to the script needs to be a simplified mask, where background is entirely black (0, 0, 0) and every different object in the image has a different shade of grey that is consistent across all channels - e.g. car1 has (255, 255, 255), car2 (254, 254, 254) and pedestrian1 (253, 253, 253).

Please also refer to other github issues covering this. Hope this helps!

I have generated the segmentation masks as below.Is this the correct format for the input ?Could someone please guide me on this?

Screenshot from 2020-03-10 17-51-16

aneliaangelova commented 4 years ago

Looks ok, it's hard to say from the image if the ids are reasonable.

aneliaangelova commented 4 years ago

You can run and follow within the code if they are ok.

poornimajd commented 4 years ago

Thanks for the quick response! I am supposed to stack three of these images in sequence as the input right?

VincentCa commented 4 years ago

Yes that's right - the masks should follow the same format as the raw images.

poornimajd commented 4 years ago

Hello , Great work! @VincentCa ,@aneliaangelova and team. I am trying to inference object motion and depth using following command : python3 inference.py --depth --egomotion true --use_masks=true --input_dir ./intest2withseg/ --output_dir ./out/ --batch_size=1 --model_ckpt ./model-35620 (intest2withseg contains the images as follows) 79539859-c5d0aa80-80a4-11ea-8743-a35c5c1416ca Corresponding mask- 79539871-c9fcc800-80a4-11ea-9046-9192d548c0d4 I get this error while running the above command

ValueError: operands could not be broadcast together with shapes (1,128,416,9) (3,128,416,27)

I tried with just giving a color image (not stacked) instead of the first image shown and the corresponding mask ,same as the one shown above,but I still get the same error. Can someone let me know what is going wrong? Also is there any other thing to change other than adding the motion masks, for getting the motion inference, when compared to inference for depth?

Any suggestion is greatly appreciated! Thanks

aneliaangelova commented 4 years ago

Hi, Are you still having trouble with this? The error basically says that you are inputting an image tensor of the wrong shapes.

poornimajd commented 4 years ago

Thanks for the reply! Yes there was a problem with the input and now it is solved. Thank you

ruili3 commented 4 years ago

Hi,

You can refer to the evaluation script of SfMLearner (https://github.com/tinghuiz/SfMLearner) for inspiration on how to evaluate on KITTI. Make sure to replicate normalization steps, crops etc. as described in the paper. The far majority of related work uses the exact same evaluation procedure and parameters (e.g. cut-off at 50/80m, mean alignment), so their implementation might be helpful, too.

See reader.py for reference; triplets can be stored in ordinary image files by stitching seq_length many frames together horizontally. They will be split up and stacked in the channel axis, you can find this in unpack_images(). If you are trying your own dataset, it can certainly be helpful to try different degrees of subsampling (temporally). Ideally, it relates to the same subsampling you applied during training, of course. If, however, the movements are slower/faster during inference, it might be a good idea to adjust, or to implement an adaptive frame rate - because if in a triplet no movement is present at all or it's very subtle, (almost) no training signal will be provided to the network.

No, it is important to note that online-refinement is applied during inference only. We do not ever save checkpoints after running refinement, as the goal is only to allow online-adaption to produce higher-quality inference results, and not to improve the network weights persistently.

Thank you for your excellent work @VincentCa @aneliaangelova, I wonder which dataset you use to train the segmentation network? I plan to conduct depth estimation using intance segmentation results, but I'm confused by which trained Maskrcnn model to choose for this task. Could you please offer the link of the pretrained Mask-RCNN model you use in Struct2Depth? Thanks a lot!

aneliaangelova commented 4 years ago

The dataset Mask-RCNN was previously trained on was MS-Coco, so it can be any model that's working reasonable for common objects. We have a followup work which also does not even need good instance segmentations, but a crude box over them: http://openaccess.thecvf.com/content_ICCV_2019/papers/Gordon_Depth_From_Videos_in_the_Wild_Unsupervised_Monocular_Depth_Learning_ICCV_2019_paper.pdf Code is also open sourced here: https://github.com/google-research/google-research/tree/master/depth_from_video_in_the_wild

tom-bu commented 3 years ago

Which cityscapes data was used to get sequential frames? Is it the leftImg8bit_trainvaltest.zip, leftImg8bit_demoVideo.zip, or leftImg8bit_sequence_trainvaltest.zip?

VincentCa commented 3 years ago

Hi Tom, if I remember correctly from the original Cityscapes release, leftImg8bit_sequence_trainvaltest.zip (as the name also suggests) contains the subsequences.

alej-tech commented 2 years ago

Hi, I wonder which dataset you use to train the struct2depth: which dataset Kitti: odometry data set (color, 65 GB)] data_odometry_color.zip) or raw data kitti?

could you be more specific: ckpt_dir="your/checkpoint/folder" data_dir="KITTI_SEQ2_LR/" # Set for KITTI

Thanks for any information

tensorflow / models

struct2depth training data prep request for documentation #6173

Describe the problem

System information