Questions about the model!

remy-byte commented 1 week ago

Hello!

First of all, I'm sending my highest regards to the whole contribution team of this project. It's a very unique idea that leverages metric scale into relative depth models. I do have some questions regarding some parts of it:

Since there is only provided code for training on Cityscapes/KITTI datasets and also car estimated heights, I'm guessing that the LSP modules and some parts of it are yet to be released?
I'm very eager to try it out with DepthAnythingV2 as a way to fine-tune its weights into producing metric depths. Is it possible to integrate it into the main pipeline's architecture like MonoDepth?
When previewing the evaluation scripts, I noticed a MAX_DEPTH value. From what I've tested with other models, such as Metric3D, that value comes into play as the whole scaling of the depth. Therefore, by setting a different value in the model, things also appear as being farther/closer to you. I'm guessing that this is also the case when talking about the MAX_DEPTH here since it is and outdoor environment. Is this a problem when training on mixed datasets? I'm wondering how this value is chosen and how it impacts the results.

Once again, thank you for making the code/weights public. I'm very eager to experiment with the architecture and see its full capabilities. 😊

GenkiK commented 1 week ago

@remy-byte Thank you for your interest in my project!

I'm sorry that I'm not ready to release the code yet. The LSP codes and the trained weights will be released soon.
Yes, it's possible, but I have a few considerations:
- You’ll need to carefully select the loss weights and learning rate. DepthAnything could initially get trapped in a local minimum of the photometric loss, so choosing the good hyperparameters is crucial to help it escape. This is especially important when training PoseNet from scratch, as it might end up fitting to DepthAnything's scale depending on the hyperparameters in the worst case.
- To keep the quality of the DepthAnything output, freezing its weights and training only the scale factor multiplied to the output could be a good approach.
The MAX_DEPTH value in the evaluation script specifies that we evaluate with the ground-truth depths closer than MAX_DEPTH meters, following Monodepth2. max_depth option in options.py defines the maximum depth value used to scale the estimated depth (though the model actually predicts normalized inverse depth). This parameter is used during training, and I guess this is what you’re referring to. We set it to 100 because our training framework focuses on road environments, where most objects are within 100 meters. 100 is a standard choice for training on the KITTI dataset, which is another reason for selecting this value.

I hope this helps! Please let me know if you have any other questions.

remy-byte commented 1 week ago

Great! Then I'll look into the code that is available for now and then when the LSP and the rest of the things are released I'll come back with a paper study and maybe other questions 😄. One more thing, since I'm trying to firstly reproduce the same training behaviour on Cityscapes for example in the case of changing the depth model, is there a limitation for the resolution that you chose for the training?

GenkiK commented 1 week ago

When training Monodepth2 on Cityscapes, we resized and cropped images to 512x192 (WxH) to align intrinsic parameters with the ones specified here. You can freely adjust these values, but please note that the car masks are provided only at the 512x192 resolution. If you choose a different image size, you'll need to resize and zero-pad the masks to fit the new resolution.

For reference, resizing and cropping are handled in the align_img_size() in datasets/cityscapes.py, which might be helpful for understanding the processing steps.

kyotovision-public / fumet

Questions about the model! #1