pengsongyou / openscene

[CVPR'23] OpenScene: 3D Scene Understanding with Open Vocabularies
https://pengsongyou.github.io/openscene
Apache License 2.0
635 stars 43 forks source link

Eval on train dataset error #62

Closed OrangeSodahub closed 11 months ago

OrangeSodahub commented 11 months ago

Hi, I was wondering, did you ever try to eval on train split? Means set split to train in e.g. ours_openseg_pretrained.yaml. I tried but got errors:

bash run/eval.sh out/scannet_openseg config/scannet/ours_openseg_pretrained.yaml ensemble
+ exp_dir=out/scannet_openseg
+ config=config/scannet/ours_openseg_pretrained.yaml
+ feature_type=ensemble
+ mkdir -p out/scannet_openseg
+ result_dir=out/scannet_openseg/result_eval
+ export PYTHONPATH=.
+ PYTHONPATH=.
+ python -u run/evaluate.py --config=config/scannet/ours_openseg_pretrained.yaml feature_type ensemble save_folder out/scannet_openseg/result_eval
++ date +%Y%m%d_%H%M
+ tee -a out/scannet_openseg/eval-20231104_0753.log
/home/ubuntu/work/env/lib/python3.8/site-packages/MinkowskiEngine/__init__.py:36: UserWarning: The environment variable `OMP_NUM_THREADS` not set. MinkowskiEngine will automatically set `OMP_NUM_THREADS=16`. If you want to set `OMP_NUM_THREADS` manually, please export it on the command line before running a python script. e.g. `export OMP_NUM_THREADS=12; python your_program.py`. It is recommended to set it below 24.
  warnings.warn(
torch.__version__:2.0.1+cu118
torch.version.cuda:11.8
torch.backends.cudnn.version:8700
torch.backends.cudnn.enabled:True
[2023-11-04 07:53:17,392 evaluate.py line 154] arch_3d: MinkUNet18A
aug: True
base_lr: 0.0001
batch_size: 8
batch_size_val: 1
classes: 20
data_root: data/openscene/scannet_3d
data_root_2d_fused_feature: data/openscene/scannet_multiview_openseg
dist_backend: nccl
dist_url: tcp://127.0.0.1:6787
distributed: False
epochs: 100
eval_freq: 1
evaluate: True
feature_2d_extractor: openseg
feature_type: ensemble
ignore_label: 255
input_color: False
loop: 5
loss_type: cosine
manual_seed: 1463
mark_no_feature_to_unknown: True
model_path: https://cvg-data.inf.ethz.ch/openscene/models/scannet_openseg.pth.tar
momentum: 0.9
multiprocessing_distributed: False
ngpus_per_node: 1
power: 0.9
print_freq: 10
prompt_eng: True
rank: 0
resume: None
save_feature_as_numpy: True
save_folder: out/scannet_openseg/result_eval
save_freq: 1
save_path: None
split: train
start_epoch: 0
sync_bn: False
test_batch_size: 1
test_gpu: [0]
test_repeats: 1
test_workers: 8
train_gpu: [0]
use_apex: False
use_shm: False
vis_gt: False
vis_input: False
vis_pred: False
voxel_size: 0.02
workers: 8
world_size: 1
Use prompt engineering: a XX in a scene
Loading CLIP ViT-L/14@336px model...
Finish loading
[2023-11-04 07:53:32,226 evaluate.py line 268] 
Evaluation 1 out of 1 runs...

  0%|          | 0/1201 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [96,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [98,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [99,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [101,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [102,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [103,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [104,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [105,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [106,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [107,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [108,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [109,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [110,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [111,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [112,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [113,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [114,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [115,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [116,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [118,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [120,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [122,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [123,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [124,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [125,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3124,0,0], thread: [127,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
  0%|          | 0/1201 [00:01<?, ?it/s]
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa4213874d7 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa42135136b in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa421423b58 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x12523e5 (0x7fa31afd23e5 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4d56c6 (0x7fa425a266c6 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7fa42136ce77 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fa42136569e in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa4213657b9 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x75acc8 (0x7fa425cabcc8 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fa425cac075 in /home/ubuntu/work/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: python() [0x5ed6cb]
frame #11: python() [0x5af10a]
frame #12: python() [0x614085]
<omitting python frames>
frame #23: python() [0x67dbf1]
frame #24: python() [0x67dc6f]
frame #25: python() [0x67dd11]
frame #29: __libc_start_main + 0xf3 (0x7fa42a34c083 in /lib/x86_64-linux-gnu/libc.so.6)

If I set to val, it looks good.

Hope some advice! If it's a bug, hope to fix it!

OrangeSodahub commented 11 months ago

And the error lies in inds_inverse:

https://github.com/pengsongyou/openscene/blob/0f369bc73d0724ae24b5e46bbada193f8ee9d193/run/evaluate.py#L303

feat_3d shape: torch.Size([18840, 768])
inds_inverse shape: torch.Size([81369])
pengsongyou commented 11 months ago

No I don't try to evaluate on the train set because during training, we input the entire point cloud (81369 points) but only supervise with the features of a subset of the point clouds (as you can see, there are only 18840 points having features). This is done due to GPU memory consideration. If you truly want to evaluate on the training set, you need to modify our feature fusion code accordingly.

OrangeSodahub commented 11 months ago

After I modify that two arguments, the feat_fuse got no errors, but the same type of error occurred at https://github.com/pengsongyou/openscene/blob/0f369bc73d0724ae24b5e46bbada193f8ee9d193/run/evaluate.py#L307