About training epoch and time

zmlshiwo commented 5 years ago

Hi, Excellent work! It is a good idea to combine the stereo matching and optical flow and use a single network. I have some questions. I saw the default epoch is set to 80. But, the time of training 80 epochs is very huge. How many epochs you have trained? How long does it take to train the entire model? Best， Zhai

lelimite4444 commented 5 years ago

Thank you for asking. I trained 80 epochs on Monodepth model, but actually 40 epochs could show comparable results and it would take about 2 days. For PWC net, because of the larger input size(832 x 256 compare to 512 x 256 in Monodepth) and more parameters, it would take about 4 days.

zmlshiwo commented 5 years ago

Thank you for your reply. I trained on a 1080Ti. It was slow. Does this code support CUDA10?

lelimite4444 commented 5 years ago

I've just tried it and it works on CUDA10. There's an error you may meet: ModuleNotFoundError: No module named 'correlation_cuda' Solution: under models/networks/correlation_package, run

python3 setup.py build
python3 setup.py install

zmlshiwo commented 5 years ago

Ok, I will try this code on a new GPU TITAN RTX and CUDA10. Is the version of pytorch still 1.0.0 at CUDA10? And I trained the network and got a error about data stream. When I trained for a while, maybe about 8000 iters, I got the following error: "OSError: unrecognized data stream contents when reading image file." My python version is Python 3.5. And I change the 'jpg' in kitti_train_files_png_4frames.txt to 'png'. Because my data is 'png'.

lelimite4444 commented 5 years ago

Yes, still PyTorch 1.0.0 and CUDA10. I have tested it for 2 epochs but no error occured. Maybe you could paste all the error message here or refer to https://github.com/mrharicot/monodepth to convert png to jpeg.

zmlshiwo commented 5 years ago

Thank you. I will try it.

zmlshiwo commented 5 years ago

@lelimite4444 Hi, I am training the Monodepth network on a TITAN RTX GPU with batch size of 2 and epoch of 80. The input resolution is 512 x 256. I find that training one epoch takes 2.25 hours. So, if training 80 epochs, it will take 2.25*80/24=7.5 days. I find the GPU memory is taken about 6GB. So, why not improve the batch size and reduce the epochs? Thank you.

lelimite4444 commented 5 years ago

As the input resolution of PWC-net is 832x256 and it would takes about 10GB. I just use the same batch size, but you can also try batch size of 3 to reduce the time. Thanks for suggestion.

zmlshiwo commented 5 years ago

@lelimite4444 Hi, I have trained the Monodepth network with 80 epochs. I get the following results. For depth, on kitti2015 stereo dataset abs_rel, sq_rel, rms, log_rms, d1_all, a1, a2, a3 0.0686, 0.8439, 4.372, 0.150, 9.455, 0.941, 0.978, 0.989 The results are worse than your paper. For flow, on KITTI2012 EPE-all 2.7403344286164057 EPE-noc 1.5548732144009207

on KITTI2015 EPE-all 8.196574949026108 Fl-all 0.30244611186808606 EPE-noc 5.489374106973409 Fl-noc 0.24511039975825583

The results are also worse than your paper. So, maybe I loss some details? I use this command to start training.

python3 train.py --data_path /home/ubuntu/Data/KITTI_raw_data/ --filenames_file ./utils/filenames/kitti_train_files_png_4frames.txt --batch_size 2 --num_epochs 80 --checkpoint_path /home/ubuntu/Data/Bridge_depth_flow_model/init/ --type_of_2warp 2 Can you find some problems about my training?

Best, Zhai

zmlshiwo commented 5 years ago

@lelimite4444 And I also find that in your paper, you set the five hyper-parameters (alpha, Beta, Lsm, Lr,L2warp) to (0.85,10,10,0.5,0.2). I find in train.py, the alpha is set to 0.85, Beta is set to 10, the Lr is set to 0.5 and the Lsm is set to 10. These four values are similar to your paper.
However, I find the L2warp is set to 0.1 in this function, "loss += 0.1 * sum([warp_2(warp2_est_4[i], left_pyramid[i][[6,7]], mask_4[i], args) for i in range(4)])" where L2warp is set to 0.1. So, this value L2warp are different from your paper. Is the difference of parameter L2warp settings causing me to get worse results?

lelimite4444 commented 5 years ago

@zmlshiwo Actually, I trained on stereo and flow without 2warp modules as a pretrained model. Having this better initialization, adding 2warp may improve the performance.

I've tried both 0.1 and 0.2 of the L2warp, it doesn't cause that much.

zmlshiwo commented 5 years ago

@lelimite4444 Thank you. So, you mean that you first train a only flow+stereo model with 80epochs. And then, you use this pretrained model as initialization model and train this model 80epochs with 2warp. So, the whole process is about 160epochs?

lelimite4444 commented 5 years ago

I trained the pretrained model with 40 epochs, so the total is 120 epochs. But i think 40+40 is enough, maybe you can use tensorboard to check the how well your model performs now.

zmlshiwo commented 5 years ago

@lelimite4444 Ok, thank you. I understand. Last time, I did not use the pre-trained model and just trained 80epochs using 2warp and without initialization model.

zmlshiwo commented 5 years ago

@lelimite4444 Hi, one more question. Is the results of model Ours (flow + stereo) in Table 1, 2, 3 only trained for 40epochs? (Training flow+stereo without 2warp 40 epochs)

wanghao14 commented 4 years ago

@lelimite4444 Hi, I want to know the setting of the super parameters when you use the model which pretrained on the stereo and flow without 2warp. Do they are consistent with these when trained from scratch?

lelimite4444 / BridgeDepthFlow

About training epoch and time #1