ucbdrive / hd3

Code for Hierarchical Discrete Distribution Decomposition for Match Density Estimation (CVPR 2019)
BSD 3-Clause "New" or "Revised" License
204 stars 31 forks source link

error when run train.sh #24

Closed AITech-D closed 4 years ago

AITech-D commented 4 years ago

Hi, Thank you for sharing your code and the pre-trained files! I was trying to re-train the network with the pre-trained file on FlyingThings3D for stereo.
(hd3sc_things-57947496.pth trained on FlyingThings3D only)

I run the train file: train.sh as CUDA_VISIBLE_DEVICES=3 python -u train.py \ --dataset_name=FlyingThings3D \ --train_root=/home/share/34916/SceneFlowDataset/FlyingThings3D \ --train_list=lists/FlyingThings3D_trainstereo.txt \ --val_root=/home/share/34916/SceneFlowDataset/FlyingThings3D \ --val_list=lists/FlyingThings3D_teststereo.txt \ --task=stereo \ --base_lr=0.0002 \ --encoder=dlaup \ --decoder=hda \ --context \ --workers=4 \ --epochs=200 \ --batch_size=4 \ --evaluate \ --batch_size_val=1 \ --pretrain=./outputs/model/model_zoo/hd3sc_things-57947496.pth \ --visual_freq=20 \ --save_step=50 \ --save_path=./outputs/model

but but but I got Error output。The output is all close to zero. I print the intermediate tensor in hd3/models/hd3net.py . code as follow: decoder = getattr(self, 'Decoder_' + str(l)) prob_map, up_feat = decoder(decoder_input)
curr_vect = density2vector(prob_map, self.dim, True)

        if l > 0:
            curr_vect += up_curr_vect                

        if self.task == 'stereo':
            curr_vect = torch.clamp(curr_vect, max=0)
            print("curr_vect mean: ", torch.mean(curr_vect))
            print("++++++++++++++++++++++")

For previous steps the mean of curr_vect is normal. Show below: [2019-11-13 02:14:00,322 INFO train.py line 259 121140] Loss total 42.9904 curr_vect mean: tensor(-1.0633, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-2.1657, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-4.5257, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-9.1923, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-18.4706, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-36.9369, device='cuda:0', grad_fn=) ++++++++++++++++++++++ [2019-11-13 02:14:01,065 INFO train.py line 256 121140] Epoch: [1/200][7/5038] Data 0.001 (0.145) Batch 0.743 (3.269) Remain 914:55:07.

But after about a few hundred steps,the mean of curr_vect is almost all zero. Show below: [2019-11-13 01:54:30,441 INFO train.py line 259 117631] Loss total 2.5231 curr_vect mean: tensor(-0.4105, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-0.0049, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-0.0027, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-0.0019, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-0.0009, device='cuda:0', grad_fn=) ++++++++++++++++++++++ curr_vect mean: tensor(-0.0006, device='cuda:0', grad_fn=) ++++++++++++++++++++++ [2019-11-13 01:54:31,190 INFO train.py line 256 117631] Epoch: [1/200][473/5038] Data 0.001 (0.003) Batch 0.750 (0.788) Remain 220:32:07.

I am very confused. I train on hd3sc_things-57947496.pth, the output was not better, but got worse after hundreds steps.

AITech-D commented 4 years ago

In addition, for error output,the loss is less than normal output. Epoch: [1/200][7/5038] : Loss total 42.9904 Epoch: [1/200][473/5038] : Loss total 2.5231

yzcjtr commented 4 years ago

What's your training dataset? Seems like the training/val lists are not identical as ours.

AITech-D commented 4 years ago

My dataset is the FlyingThings3D . It is identical as your dataset. I did not change any code but at data/hd3data.py. showed below:

def read_gen(file_name, mode): ext = splitext(file_name)[-1] if mode == 'image': assert ext in ['.png', '.jpeg', '.ppm', '.jpg']

======Here changed!

    data = Image.open(file_name)
    data = data.convert('RGB')
    return data
    # return Image.open(file_name)
    # ======Here changed!
elif mode == 'flow':
    assert ext in ['.flo', '.png', '.pfm']
    return fl.read_flow(file_name)
elif mode == 'stereo':
    assert ext in ['.png', '.pfm']
    return fl.read_disp(file_name)
else:
    raise ValueError('Unknown mode {}'.format(mode))

I used the same dataset as your pretained file hd3sc_things-57947496.pth. Can you help me, sir?

AITech-D commented 4 years ago

And I start from scratch train hd3 model on FlyingThings3D, Error is same.The loss dropped quickly and the accuracy did not improve,output is all zereo. train.sh as below: CUDA_VISIBLE_DEVICES=3 python -u train.py --dataset_name=FlyingThings3D --train_root=/home/share/34916/SceneFlowDataset/FlyingThings3D --train_list=lists/FlyingThings3D_trainstereo.txt --val_root=/home/share/34916/SceneFlowDataset/FlyingThings3D --val_list=lists/FlyingThings3D_teststereo.txt --task=stereo --base_lr=0.0002 --encoder=dlaup --decoder=hda --context --workers=4 --epochs=200 --batch_size=4 --evaluate --batch_size_val=1 --pretrain_base=./outputs/model/model_zoo/dla34-ba72cf86.pth --visual_freq=20 --save_step=5 --save_path=./outputs/model

train list: frames_finalpass/TRAIN/B/0573/left/0006.png frames_finalpass/TRAIN/B/0573/right/0006.png disparity/TRAIN/B/0573/left/0006.pfm frames_finalpass/TRAIN/B/0573/left/0007.png frames_finalpass/TRAIN/B/0573/right/0007.png disparity/TRAIN/B/0573/left/0007.pfm frames_finalpass/TRAIN/B/0573/left/0008.png frames_finalpass/TRAIN/B/0573/right/0008.png disparity/TRAIN/B/0573/left/0008.pfm frames_finalpass/TRAIN/B/0573/left/0009.png frames_finalpass/TRAIN/B/0573/right/0009.png disparity/TRAIN/B/0573/left/0009.pfm frames_finalpass/TRAIN/B/0573/left/0010.png frames_finalpass/TRAIN/B/0573/right/0010.png disparity/TRAIN/B/0573/left/0010.pfm frames_finalpass/TRAIN/B/0573/left/0011.png frames_finalpass/TRAIN/B/0573/right/0011.png disparity/TRAIN/B/0573/left/0011.pfm frames_finalpass/TRAIN/B/0573/left/0012.png frames_finalpass/TRAIN/B/0573/right/0012.png disparity/TRAIN/B/0573/left/0012.pfm frames_finalpass/TRAIN/B/0573/left/0013.png frames_finalpass/TRAIN/B/0573/right/0013.png disparity/TRAIN/B/0573/left/0013.pfm frames_finalpass/TRAIN/B/0573/left/0014.png frames_finalpass/TRAIN/B/0573/right/0014.png disparity/TRAIN/B/0573/left/0014.pfm frames_finalpass/TRAIN/B/0299/left/0006.png frames_finalpass/TRAIN/B/0299/right/0006.png disparity/TRAIN/B/0299/left/0006.pfm frames_finalpass/TRAIN/B/0299/left/0007.png frames_finalpass/TRAIN/B/0299/right/0007.png disparity/TRAIN/B/0299/left/0007.pfm 。。。。。。 。。。。。。 frames_finalpass/TRAIN/B/0299/left/0008.png frames_finalpass/TRAIN/B/0299/right/0008.png disparity/TRAIN/B/0299/left/0008.pfm frames_finalpass/TRAIN/B/0299/left/0009.png frames_finalpass/TRAIN/B/0299/right/0009.png disparity/TRAIN/B/0299/left/0009.pfm frames_finalpass/TRAIN/B/0299/left/0010.png frames_finalpass/TRAIN/B/0299/right/0010.png disparity/TRAIN/B/0299/left/0010.pfm frames_finalpass/TRAIN/B/0299/left/0011.png frames_finalpass/TRAIN/B/0299/right/0011.png disparity/TRAIN/B/0299/left/0011.pfm frames_finalpass/TRAIN/B/0299/left/0012.png frames_finalpass/TRAIN/B/0299/right/0012.png disparity/TRAIN/B/0299/left/0012.pfm

test list: frames_finalpass/TEST/B/0040/left/0006.png frames_finalpass/TEST/B/0040/right/0006.png disparity/TEST/B/0040/left/0006.pfm frames_finalpass/TEST/B/0040/left/0007.png frames_finalpass/TEST/B/0040/right/0007.png disparity/TEST/B/0040/left/0007.pfm frames_finalpass/TEST/B/0040/left/0008.png frames_finalpass/TEST/B/0040/right/0008.png disparity/TEST/B/0040/left/0008.pfm frames_finalpass/TEST/B/0040/left/0009.png frames_finalpass/TEST/B/0040/right/0009.png disparity/TEST/B/0040/left/0009.pfm frames_finalpass/TEST/B/0040/left/0010.png frames_finalpass/TEST/B/0040/right/0010.png disparity/TEST/B/0040/left/0010.pfm frames_finalpass/TEST/B/0040/left/0011.png frames_finalpass/TEST/B/0040/right/0011.png disparity/TEST/B/0040/left/0011.pfm frames_finalpass/TEST/B/0040/left/0012.png frames_finalpass/TEST/B/0040/right/0012.png disparity/TEST/B/0040/left/0012.pfm 。。。。。。 。。。。。。 frames_finalpass/TEST/B/0040/left/0013.png frames_finalpass/TEST/B/0040/right/0013.png disparity/TEST/B/0040/left/0013.pfm frames_finalpass/TEST/B/0040/left/0014.png frames_finalpass/TEST/B/0040/right/0014.png disparity/TEST/B/0040/left/0014.pfm frames_finalpass/TEST/B/0133/left/0006.png frames_finalpass/TEST/B/0133/right/0006.png disparity/TEST/B/0133/left/0006.pfm

AITech-D commented 4 years ago

The operation D2V and V2D is opposite? I understand the V2D operation code in models/hd3_ops.py named vector2density(vect, c, dim). but don't understand D2V operation code named density2vector(prob, dim, normalize=True).

For the ERROR I got , I guess it is something wrong with the loss function. Because I cheched my input and the hd3 model. but I have no idea what is wrong with the loss. if U can help me something. Very Thanks,

yzcjtr commented 4 years ago

Seems like you are not using the training/validation lists we provided. And the dataset structure is not the same as the official FlyingThings3D subset. I'm not sure why you added "Image.convert(RGB)" in your code as our dataloader works perfectly with FlyingThings3D subset already. Possibly the annotations you loaded are all zeros. The original FlyingThings3D dataset is problematic for the rendering is imperfect. Please do redownload the subset on the official webpage.

As for the D2V and V2D operations, you can refer to our paper for their principal.

AITech-D commented 4 years ago

Thank U. There is something wrong in the dataloader. And I have solved it.

wmn931201 commented 3 years ago

Hi,@AITech-D,I have the same problem as you. How did you solve it ? very thanks!