hisfog / SfMNeXt-Impl

[AAAI 2024] Official implementation of "SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation", and more.
MIT License
85 stars 12 forks source link

Promblem about reproducing the results #13

Open zsz-pro opened 11 months ago

zsz-pro commented 11 months ago

Nice work! I would appreciate your guidance on the following two questions: 1.This is the result of my testing with the latest code using the 'kitti-resnet50-640*192' weights. How do you perceive the errors introduced by shadows? image 2.What potential issues do you think might arise when using this depth estimation result for novel view synthesis? It seems that this adaptive binning approach is very friendly for NVS(novel view synthesis).

hisfog commented 11 months ago

Is this an error map? or a depth map?

zsz-pro commented 11 months ago

I run the test_simple_SQL_config.py without any revise. It is supposed to be a depth map? image

hisfog commented 11 months ago

I think you did not load the pre-trained weights, you should set --load_pretrained_model in you args_file. Sorry for that, i changed the test_simple_SQL_config.py but did not make modifications in args_files.

zsz-pro commented 11 months ago

Yeah,thanks for reminding me! And the modifications for the 'kitti-resnet50-640*192' weight with respect to key arguments in args_files(For the convenience of others who might need it): --backbone resnet --num_features 256 --dim_out 64 --batch_size 16 --model_dim 32 --patch_size 16 --query_nums 64

I have one more question to ask: What potential issues do you think might arise when using this depth estimation result for novel view synthesis? It seems that this adaptive binning approach is very friendly for NVS(novel view synthesis).

zsz-pro commented 11 months ago

I trained for 20 epochs, and the evaluation results are quite different from the evaluation results using the weights you provided(kitti-resnet50-640*192).Could you give me some advice? image

hisfog commented 11 months ago

Can you provide you training args (in you args_file)?

zsz-pro commented 11 months ago

--data_path /mnt/kitti_data/raw/ --dataset kitti --eval_split eigen --height 192 --width 640 --batch_size 16 --model_dim 32 --patch_size 16 --query_nums 64 --eval_mono --post_process --min_depth 0.01 --max_depth 80.0 --save_pred_disps --backbone resnet --num_features 256 --dim_out 64 encoder_layers = nn.modules.transformer.TransformerEncoderLayer(embedding_dim, num_heads, dim_feedforward=512) I change the dim_feedforward from 1024 to 512 in Depth_Decoder_QueryTr, because the weights you provided (kitti-resnet50-640*192) are also compatible with dim_feedforward=512.

seoAlexer commented 11 months ago

I think the latest code seems to be for indoor scenes (may be?). But I can reproduce the author's results on KITTI using an older version of the code. image My git hash is 6a1e997f97caef8de080bb2873f71cfbad9a8047, you can switch to this version by

git checkout 6a1e997f97caef8de080bb2873f71cfbad9a8047

Hope it can help you

mengtanZ commented 11 months ago

I think the latest code seems to be for indoor scenes (may be?). But I can reproduce the author's results on KITTI using an older version of the code. image My git hash is 6a1e997f97caef8de080bb2873f71cfbad9a8047, you can switch to this version by

git checkout 6a1e997f97caef8de080bb2873f71cfbad9a8047

Hope it can help you

I can not reproduce the author's results like you, can you provide the details and args in your training? for example, did you just use the 'args_files\args_res50_kitti_192x640_train.txt' provided by author? or you make some changes refer to 'args_files\hisfog\kitti\resnet_192x640.txt' ? and did you add '--backbone resnet' as the paper ,or you just use the default Unet? I think these may be the reason I can not get the right result as you. thank you!

seoAlexer commented 11 months ago

My args file is args_files\hisfog\kitti\resnet_320x1024.txt, and backbone is --backbone resnet_lite. Since the args_files\args_res50_kitti_192x640_train.txt does not set --use stereo, which i think is for monocular training only.

mengtanZ commented 11 months ago

My args file is args_files\hisfog\kitti\resnet_320x1024.txt, and backbone is --backbone resnet_lite. Since the args_files\args_res50_kitti_192x640_train.txt does not set --use stereo, which i think is for monocular training only.

thank you very much! so you use the args_file for testing to train without changing (such as --use stereo) ,right? I will have a try

mengtanZ commented 11 months ago

My args file is args_files\hisfog\kitti\resnet_320x1024.txt, and backbone is --backbone resnet_lite. Since the args_files\args_res50_kitti_192x640_train.txt does not set --use stereo, which i think is for monocular training only.

I ‘ve tried as your suggestion, but it doesn't seem to be working. Here is my result after 20 epoch,

image

and my args_file is image I wonder if it's different from yours in any way? @seoAlexer @hisfog @zsz-pro

seoAlexer commented 11 months ago

I use code with git hash 6a1e997f97caef8de080bb2873f71cfbad9a8047. I do not set --diff_lr, and --min_depth is set to 0.001.

hisfog commented 11 months ago

I have one more question to ask: What potential issues do you think might arise when using this depth estimation result for novel view synthesis? It seems that this adaptive binning approach is very friendly for NVS(novel view synthesis).

Using a depth map and differentiable warp for NVS may fail to synthesize occluded areas. But I'm not entirely sure why you're utilizing a depth map for NVS.

indu1ge commented 10 months ago

I cannot reproduce the results, either. I tried the code with git hash 6a1e997f97caef8de080bb2873f71cfbad9a8047 using the same configuration, and my absrel is 0.108. I wonder if it has something to do with the environment. If it is, could you please more details about settinng up the environment?

My args file is args_files\hisfog\kitti\resnet_320x1024.txt, and backbone is --backbone resnet_lite. Since the args_files\args_res50_kitti_192x640_train.txt does not set --use stereo, which i think is for monocular training only.

I ‘ve tried as your suggestion, but it doesn't seem to be working. Here is my result after 20 epoch, image and my args_file is image I wonder if it's different from yours in any way? @seoAlexer @hisfog @zsz-pro

hisfog commented 10 months ago

@indu1ge 1) Do not use --diff_lr unless you have loaded a well pre-trained pose_net. 2) Based on my experience, a min_depth of 0.001 might be better. 3) Additionally, the best results may not necessarily occur at 20 epochs; they could appear earlier, such as at 15 epochs.

Shaw-Way commented 9 months ago

I tried training with ResNet18 as the backbone for 20 epochs, with the following settings: --data_path ./data/kitti_raw --dataset kitti --eval_split eigen --height 192 --width 640 --batch_size 16 --num_epochs 25 --model_dim 32 --patch_size 16 --query_nums 120 --scheduler_step_size 15 --eval_mono --post_process --min_depth 0.001 --max_depth 80.0 --backbone resnet18_lite But didn't get good result. This is the best result I have got. image I don't know what went wrong.

hisfog commented 9 months ago

@Shaw-Way The first SSL training, especially for monocular training only, may not yield optimal results, and this is normal,as PoseNet might not have converged yet. You can refer to the experimental setups of other successful replications, e.g. https://github.com/hisfog/SfMNeXt-Impl/issues/13#issuecomment-1754337890, https://github.com/hisfog/SfMNeXt-Impl/issues/26#issuecomment-1840013244

Shaw-Way commented 9 months ago

@Shaw-Way The first SSL training, especially for monocular training only, may not yield optimal results, and this is normal,as PoseNet might not have converged yet. You can refer to the experimental setups of other successful replications, e.g. #13 (comment), #26 (comment)

This result shows a significant gap from the metrics in your paper. Did you achieve those metrics directly through SSL training, or were there additional fine-tuning steps?

hisfog commented 9 months ago

This result shows a significant gap from the metrics in your paper. Did you achieve those metrics directly through SSL training, or were there additional fine-tuning steps?

I only did supervised fine-tuning for the ConvNeXt-L model. Other results are produced by SSL training only.

Shaw-Way commented 9 months ago

This result shows a significant gap from the metrics in your paper. Did you achieve those metrics directly through SSL training, or were there additional fine-tuning steps?

I only did supervised fine-tuning for the ConvNeXt-L model. Other results are produced by SSL training only.

Thanks for your reply. Do you think training for more epochs would be helpful, or is there any problems with my settings.

hisfog commented 9 months ago

Thanks for your reply. Do you think training for more epochs would be helpful, or is there any problems with my settings.

More epochs might not be helpful, for settings, you can refer to https://github.com/hisfog/SfMNeXt-Impl/issues/13#issuecomment-1808457711, https://github.com/hisfog/SfMNeXt-Impl/issues/26#issuecomment-1840013244, and https://github.com/hisfog/SfMNeXt-Impl/issues/13#issuecomment-1752824019. Hope that can help you.

xxxqqqqwww commented 6 months ago

Hello, I would like to reproduce the SSL results of kitti-resnet50-1024 * 320, based on args_files\hislog\kitti\resnet_320x1024. txt, I did not set -- diff_lr,other parameters remain unchanged.I did not use any pre-training weights,the training result of abs rel is larger than 0.1, and other parameters are not optimal. As you mentioned above, it is normal for the first self supervised training to not achieve the best result. I would like to know how to train after the first self supervised training to achieve abs rel=0.082. I would like to know the training details after the first self supervised training.After the first round of training is completed, is there a need for a second or multiple rounds of training, does the second training process need to use the weights of one of the rounds of the first training as the pre-training weights, is it necessary to load both pose.pth, encoder.pth, and depth.pth or is it necessary to only use pose.pth and set -- diff_lr.

XIAN-XIAN-X commented 5 months ago

Hello, I would like to reproduce the SSL results of kitti-resnet50-1024 * 320, based on args_files\hislog\kitti\resnet_320x1024. txt, I did not set -- diff_lr,other parameters remain unchanged.I did not use any pre-training weights,the training result of abs rel is larger than 0.1, and other parameters are not optimal. As you mentioned above, it is normal for the first self supervised training to not achieve the best result. I would like to know how to train after the first self supervised training to achieve abs rel=0.082. I would like to know the training details after the first self supervised training.After the first round of training is completed, is there a need for a second or multiple rounds of training, does the second training process need to use the weights of one of the rounds of the first training as the pre-training weights, is it necessary to load both pose.pth, encoder.pth, and depth.pth or is it necessary to only use pose.pth and set -- diff_lr.

hello!I have the same confusion as you, have you solved it

XIAN-XIAN-X commented 5 months ago

I tried training with ResNet18 as the backbone for 20 epochs, with the following settings: --data_path ./data/kitti_raw --dataset kitti --eval_split eigen --height 192 --width 640 --batch_size 16 --num_epochs 25 --model_dim 32 --patch_size 16 --query_nums 120 --scheduler_step_size 15 --eval_mono --post_process --min_depth 0.001 --max_depth 80.0 --backbone resnet18_lite But didn't get good result. This is the best result I have got. image I don't know what went wrong.

hello!I have the same confusion as you, have you solved it @Shaw-Way

BlueEg commented 5 months ago

I cannot reproduce the results, either. I tried the code with git hash 6a1e997f97caef8de080bb2873f71cfbad9a8047 using the same configuration, and my absrel is 0.108. I wonder if it has something to do with the environment. If it is, could you please more details about settinng up the environment?

My args file is args_files\hisfog\kitti\resnet_320x1024.txt, and backbone is --backbone resnet_lite. Since the args_files\args_res50_kitti_192x640_train.txt does not set --use stereo, which i think is for monocular training only.

I ‘ve tried as your suggestion, but it doesn't seem to be working. Here is my result after 20 epoch, image and my args_file is image I wonder if it's different from yours in any way? @seoAlexer @hisfog @zsz-pro

Hello, I met the same question with you. And I want to ask if you use JEPG images to train the model, or '.png'. I look though the code, the author may use a png dataset to train and val the model.

hwlf commented 1 month ago

我的 args 文件是 ,backbone 是 。由于没有设置 --use stereo,我认为它仅用于单目训练。args_files\hisfog\kitti\resnet_320x1024.txt``--backbone resnet_lite``args_files\args_res50_kitti_192x640_train.txt

Hello, I have a question. When I don't set pre training posenet and don't set -- use strereo, the results are very poor. What is the reason for this? abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.444 & 4.749 & 12.046 & 0.586 & 0.303 & 0.560 & 0.767 \

--data_path /home/Clandy/data --log_dir /home/Clandy/train_models/tree --model_name res_099 --dataset kitti --eval_split eigen --backbone resnet --height 192 --width 640 --batch_size 16 --num_epochs 25 --scheduler_step_size 15 --num_layers 50 --num_features 256 --model_dim 32 --patch_size 16 --dim_out 64 --query_nums 64 --min_depth 0.001 --max_depth 80.0 --eval_mono --load_weights_folder /home/Clandy/train_models/tree/res_099/models/weights_24 --post_process