Problem about training - Githubissues

cht123456abc commented 2 years ago

Hello, here are two questions about training. When I use this <python train_stereo.py --train_datasets middlebury_Q --batch_size 2 --train_iters 22 --valid_iters 32 --n_downsample 3 --num_steps 200 --mixed_precision> with a 8g memory gpu, there will be problems at some data point, as shown in the figure below：

And another problem occurs when i use this sentence <python train_stereo.py --train_datasets middlebury_Q --batch_size 1 --train_iters 22 --valid_iters 32 --n_downsample 2 --num_steps 200 --mixed_precision>,i got else at some point

if --batch_size is 2 and --n_downsample is 2,there will be no problems. And i have to set batch_size to 1 due to my low gpu memory. Could you help me with this problem?Please forgive my ignorance. Any information offered will be important. Thanks

lahavlipson commented 2 years ago

These two errors look related. It appears that the image dimensions are incorrect (maybe you modified the dataloader/dataset/augmentor?). Probably the batch dimension got removed at some point.

I can't speak to exactly why your code is only working when the batch size is > 1. However, note that the feature extractor downsamples the image by 4x, and the correlation pyramid downsamples the width of the image another 8x times. If a tensor has a width of only a few pixels (i.e. the first error), downsampling further will give you a zero-width tensor.

cht123456abc commented 2 years ago

Thanks a lot ! If I replace dataset middlebury_Q with middlebury_H and set n_downsample to 3 ,the errors above will disappear.

princeton-vl / RAFT-Stereo

Problem about training #21