princeton-vl / DROID-SLAM

BSD 3-Clause "New" or "Revised" License
1.66k stars 273 forks source link

Error when using default weights "droid.pth" as pretrained weights #52

Open YznMur opened 2 years ago

YznMur commented 2 years ago

Hi @zachteed @xhangHU I couldn't use your weights "droid.pth" for training? I faced this error:

Traceback (most recent call last):
  File "train.py", line 189, in <module>
    mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/trainer/droidslam/train.py", line 60, in train
    model.load_state_dict(torch.load(args.ckpt))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        size mismatch for module.update.weight.2.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([2, 128, 3, 3]).
        size mismatch for module.update.weight.2.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
        size mismatch for module.update.delta.2.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([2, 128, 3, 3]).
        size mismatch for module.update.delta.2.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]). 

I am trying to train the model on KITTI These are the parameters which I am using :

 clip=2.5,  edges=24, fmax=96.0, fmin=8.0, gpus=4, iters=15, lr=5e-05, n_frames=7, noise=False, restart_prob=0.2, scale=False, steps=250000, w1=10.0, w2=0.01, w3=0.05, world_size=4
YznMur commented 2 years ago

I figured it out and made some changes in class UpdateModule(nn.Module)

        self.weight = nn.Sequential(
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 3, 3, padding=1),
            GradientClip(),
            nn.Sigmoid())

also

 self.delta = nn.Sequential(
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 3, 3, padding=1),
            GradientClip())

if you have any advices about training the model on KITTI or training config, it will be appreciated !

felipesce commented 1 year ago

Why do we need to change the model shape for training vs inference?

CH-901 commented 1 year ago

@YznMur Hi, How do you train kitti, using rgb or rgbd? If rgbd is used, how to obtain depth images of kitti? thank you!

YznMur commented 1 year ago

@CH-901 I used RGBD. You can find depth images for KITTI sequences here : https://www.cvlibs.net/datasets/kitti/eval_depth_all.php This may help u:

Odometry Nr. Raw sequence name Start End
00: 2011_10_03_drive_0027 000000 004540
01: 2011_10_03_drive_0042 000000 001100
02: 2011_10_03_drive_0034 000000 004660
03: 2011_09_26_drive_0067 000000 000800
04: 2011_09_30_drive_0016 000000 000270
05: 2011_09_30_drive_0018 000000 002760
06: 2011_09_30_drive_0020 000000 001100
07: 2011_09_30_drive_0027 000000 001100
08: 2011_09_30_drive_0028 001100 005170
09: 2011_09_30_drive_0033 000000 001590
10: 2011_09_30_drive_0034 000000 001200
CH-901 commented 1 year ago

@YznMur Thank you for your reply. I found the depth data. If only rgb training is used, what value should be entered for disp0 in the training code model(Gs, images, disp0, intrinsics0, graph, num_steps=args.iters, fixedp=2)

YznMur commented 1 year ago

There are the KITTI 2015 stereo disparities, you can find them here: http://www.cvlibs.net/download.php?file=data_scene_flow.zip

CH-901 commented 1 year ago

@YznMur Thanks. What is the RPE of the model after the training of Kitti dataset in your experiment. My output seems to be random although the loss is decreasing.

YznMur commented 1 year ago

Hi @CH-901 About loss, I faced the same problem, it was meaningless :(

YznMur commented 1 year ago

Hi @CH-901 Did u mange to solve the problem with training?

CH-901 commented 1 year ago

I haven't solved this problem @YznMur

LinMenwill commented 1 year ago

Hi @YznMur I download the depth images for KITTI sequences from:https://www.cvlibs.net/datasets/kitti/eval_depth_all.php. But each sequence of depth images seems to be missing files from 000000.png to 000004.png. Have you encountered the same issue?

YznMur commented 1 year ago

Hi @LinMenwill. Just take this into consideration when u r preparing train and eval lists.

LinMenwill commented 10 months ago

@YznMur Thanks

LinMenwill commented 10 months ago

@YznMur I found that the depth data downloaded from https://www.cvlibs.net/datasets/kitti/eval_depth_all.php lacks the presence of the sequence "03:2011_09_26_drive_0067." and the depth data appears to be sparse, how did you convert this sparse depth data into a denser representation? or use sparse depth for training directly?