SUN RGB-D, wrong values, when creating .pkl files

virusapex commented 3 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug Whenever, calibration files from the SUN RGB-D dataset are being read, MMDetection3D stores Rt values in both Rt and K fields in infos.pkl files.

Reproduction

What command or script did you run?

python tools/create_data.py sunrgbd --root-path ./data/sunrgbd --out-dir ./data/sunrgbd --extra-tag sunrgbd
Did you make any modifications on the code or config? Did you understand what you have modified?

Nothing was changed.
What dataset did you use?

SUN RGB-D

Environment

Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.


sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.0.dev20210103+cu101
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.1, CUDNN_VERSION=7.6.3, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0.dev20210103+cu101 OpenCV: 4.4.0 MMCV: 1.3.1 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 10.1 MMDetection: 2.11.0 MMDetection3D: 0.12.0+e21e61e

2. You may add addition that may be helpful for locating the problem, such as
    - How you installed PyTorch [e.g., pip, conda, source]
    - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)

**Error traceback**
If applicable, paste the error trackback here.

**Bug fix**
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

In `/tools/data_converter/sunrgbd_data_utils.py` should probably make changes in this function:

def get_calibration(self, idx): calib_filepath = osp.join(self.calib_dir, f'{idx:06d}.txt') lines = [line.rstrip() for line in open(calib_filepath)] Rt = np.array([float(x) for x in lines[0].split(' ')]) Rt = np.reshape(Rt, (3, 3), order='F').astype(np.float32) K = np.array([float(x) for x in lines[1].split(' ')]) K = np.reshape(Rt, (3, 3), order='F').astype(np.float32) return K, Rt

To get this:

def get_calibration(self, idx): calib_filepath = osp.join(self.calib_dir, f'{idx:06d}.txt') lines = [line.rstrip() for line in open(calib_filepath)] Rt = np.array([float(x) for x in lines[0].split(' ')]) Rt = np.reshape(Rt, (3, 3), order='F').astype(np.float32) K = np.array([float(x) for x in lines[1].split(' ')]) K = np.reshape(K, (3, 3), order='F').astype(np.float32) return K, Rt

Tai-Wang commented 3 years ago

Thanks for your bug report. Could you please create a PR to fix it? BTW, does this bug affect any models on SUNRGBD?

virusapex commented 3 years ago

Thanks for your bug report. Could you please create a PR to fix it? BTW, does this bug affect any models on SUNRGBD?

Yes, I have created a PR. Hopefully, it was correct, since it's the first time for me. I've trained the VoteNet model and got a similar accuracy to the one you posted in Readme.md, but that was without this fix. I'm not sure, if changing this value will change the accuracy.

Wuziyi616 commented 3 years ago

As far as I know, VoteNet uses only 3D point clouds without 2D image, while the K and Rt are used for 3D to 2D projection. So this issue won't affect the performance of VoteNet. But it might affect ImVoteNet which uses projection and K. Maybe we can train an ImVoteNet later after the fix to check its performance?

yezhen17 commented 3 years ago

Yes this bug shall affect the performance of ImVoteNet but this fix should solve the problem. I believe the root cause of #448 is also this bug.

virusapex commented 3 years ago

As far as I know, VoteNet uses only 3D point clouds without 2D image, while the K and Rt are used for 3D to 2D projection. So this issue won't affect the performance of VoteNet. But it might affect ImVoteNet which uses projection and K. Maybe we can train an ImVoteNet later after the fix to check its performance?

Hello, again! Sorry for re-opening the issue, but as per your suggestion, I was able to re-train ImVoteNet model after the fix, training both first and second stages myself. I got 61.99 AP@0.25, which is more or less similar to your 64.04, albeit within a big margin. Seems like, the model didn't suffer much from the bug.

yezhen17 commented 3 years ago

Hi, I think 61.99 is a bit low. Do you mean 61.99 is achieved with the correct code?

virusapex commented 3 years ago

Hi, I think 61.99 is a bit low. Do you mean 61.99 is achieved with the correct code?

Yeah, the model just finished training. I re-generated the dataset 2 days ago. Although, I'm not entirely sure about training the 2nd stage - am I supposed to link in the config the weights I got from the first stage or should I've just used yours? In any case, I would get warnings like missing keys in source state_dict: pts_backbone.SA_modules.0.mlps.0.layer0.conv.weight,pts_backbone.SA_modules.0.mlps.0.layer0.bn.weight,pts_backbone.SA_modules.0.mlps.0.layer0.bn.bias,pts_backbone.SA_modules.0.mlps.0.layer0.bn.running_mean,pts_backbone.SA_modules.0.mlps.0.layer0.bn.running_var, and so on. Which is understandable since we are only using a 2D network to train a 3D one and they have different layers. Or maybe I'm doing something wrong here.

yezhen17 commented 3 years ago

The missing keys warning is normal because we did not load the 3d backbone. You can either link the weights you trained or provided by us. How much did you get from first stage?

There is also a probability that the model is simply unlucky. As SUN RGB-D is not a very big dataset, some fluxation is expected.

virusapex commented 3 years ago

Ok, makes sense then. I got mAP@0.5 = 61.08 which is, again lower than your 62.7. I could be unlucky =)

virusapex commented 3 years ago

Oh, since I forgot to mention you, @THU17cyz, you probably didn't see my previous message. BTW, unrelated question, sorry for bothering you, but I don't completely understand the size of the model weights, since the first stage is 330Mb, but the 2nd stage is just 190Mb, but the ImVoteNet model only grows in complexity with fusion of layers, does it not?

Wuziyi616 commented 3 years ago

@THU17cyz May have a look at this comment?

yezhen17 commented 3 years ago

Oh, since I forgot to mention you, @THU17cyz, you probably didn't see my previous message. BTW, unrelated question, sorry for bothering you, but I don't completely understand the size of the model weights, since the first stage is 330Mb, but the 2nd stage is just 190Mb, but the ImVoteNet model only grows in complexity with fusion of layers, does it not?

Hi @virusapex ,

Terribly sorry that I forgot to reply you. I was busy with graduation-related things the past two weeks.

The strange thing in the size of the model weights is probably because in the first stage the img branch isn't frozen, but in the second stage the img branch is frozen.

As for the performance, mAP@0.5 = 61.08 in stage 1 is also reasonable.

virusapex commented 3 years ago

Oh, since I forgot to mention you, @THU17cyz, you probably didn't see my previous message. BTW, unrelated question, sorry for bothering you, but I don't completely understand the size of the model weights, since the first stage is 330Mb, but the 2nd stage is just 190Mb, but the ImVoteNet model only grows in complexity with fusion of layers, does it not?

Hi @virusapex ,

Terribly sorry that I forgot to reply you. I was busy with graduation-related things the past two weeks.

The strange thing in the size of the model weights is probably because in the first stage the img branch isn't frozen, but in the second stage the img branch is frozen.

As for the performance, mAP@0.5 = 61.08 in stage 1 is also reasonable.

Hey, @THU17cyz ,

Don't sweat it! Wishing you the best!

Yeah, I can see that the image branch is frozen, hence the freeze_img_branch=True, statement in the config, but how exactly does it reduce the weights of the final model? All the weights are still included, no? It seems, there is some kind of an issue, because:

Your pretrained ImVoteNet: first stage model = 158MB, second stage = 166MB; The one I'm getting: first stage model = 315MB, second stage = 181MB.

Wuziyi616 commented 3 years ago

Is it possible that your saved checkpoint includes state_dict of optimizer while ours doesn't?

yezhen17 commented 3 years ago

Is it possible that your saved checkpoint includes state_dict of optimizer while ours doesn't?

I think @Wuziyi616 got the point. The model ckpts in the model zoo is processed thru this script, so the optimizer state_dict is deleted from the ckpt file.

And since in second stage the img branch parameters are frozen, there is less related data in the optimizer state_dict.

virusapex commented 3 years ago

Is it possible that your saved checkpoint includes state_dict of optimizer while ours doesn't?

I think @Wuziyi616 got the point. The model ckpts in the model zoo is processed thru this script, so the optimizer state_dict is deleted from the ckpt file.

And since in second stage the img branch parameters are frozen, there is less related data in the optimizer state_dict.

Oh, didn't know about that. Thank you for the clarification, I gotta study PyTorch more =) I believe, we can close the issue.

yezhen17 commented 3 years ago

Is it possible that your saved checkpoint includes state_dict of optimizer while ours doesn't?

I think @Wuziyi616 got the point. The model ckpts in the model zoo is processed thru this script, so the optimizer state_dict is deleted from the ckpt file. And since in second stage the img branch parameters are frozen, there is less related data in the optimizer state_dict.

Oh, didn't know about that. Thank you for the clarification, I gotta study PyTorch more =) I believe, we can close the issue.

Happy to be helpful :-) Closing this issue now.

open-mmlab / mmdetection3d

SUN RGB-D, wrong values, when creating .pkl files #507