Training error about detectron2

engine210 commented 3 years ago
Hi,
During the training, I've encountered the error below:
2021-06-02 19:47:08.050507: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-06-02 19:47:10.647322: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-06-02 19:47:14.161774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:3b:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.163223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:5e:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.164597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:86:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.165981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:af:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.166042: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-06-02 19:47:14.168723: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-06-02 19:47:14.170526: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-06-02 19:47:14.171492: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-06-02 19:47:14.175216: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-06-02 19:47:14.177022: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-06-02 19:47:14.182611: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-06-02 19:47:14.192325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2021-06-02 19:47:14.193076: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-02 19:47:14.205182: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2100000000 Hz
2021-06-02 19:47:14.206884: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xa3fe5b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-02 19:47:14.206923: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-06-02 19:47:14.748093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:3b:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.749386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:5e:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.750610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:86:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.751809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:af:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-06-02 19:47:14.751872: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-06-02 19:47:14.751902: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-06-02 19:47:14.751941: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-06-02 19:47:14.751958: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-06-02 19:47:14.751976: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-06-02 19:47:14.751996: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-06-02 19:47:14.752014: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-06-02 19:47:14.761251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
2021-06-02 19:47:14.761304: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-06-02 19:47:16.808596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 19:47:16.808665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 2 3
2021-06-02 19:47:16.808686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N Y Y Y
2021-06-02 19:47:16.808696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   Y N Y Y
2021-06-02 19:47:16.808706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 2:   Y Y N Y
2021-06-02 19:47:16.808716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 3:   Y Y Y N
2021-06-02 19:47:16.816101: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-06-02 19:47:16.816159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30132 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2021-06-02 19:47:16.819248: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-06-02 19:47:16.819287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30132 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:5e:00.0, compute capability: 7.0)
2021-06-02 19:47:16.822083: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-06-02 19:47:16.822118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30132 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0)
2021-06-02 19:47:16.824747: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-06-02 19:47:16.824781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30132 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0)
2021-06-02 19:47:16.827988: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a36fae0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-06-02 19:47:16.828018: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
2021-06-02 19:47:16.828030: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-PCIE-32GB, Compute Capability 7.0
2021-06-02 19:47:16.828041: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-PCIE-32GB, Compute Capability 7.0
2021-06-02 19:47:16.828058: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-PCIE-32GB, Compute Capability 7.0
Total Params 2559576
Img Model 2405676
Text Model 153900
Loading Saved Model
  0%|                                                                                                                                                                                                 | 0/2528 [00:00<?, ?it/s]2021-06-02 19:47:48.137599: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
  2%|███▍                                                                                                                                                                                    | 48/2528 [01:08<48:47,  1.18s/it]Traceback (most recent call last):
  File "trainer_scipt.py", line 232, in <module>
    train_joint_model()
  File "trainer_scipt.py", line 156, in train_joint_model
    train_model(epoch)
  File "trainer_scipt.py", line 85, in train_model
    z_img, z_t_match, z_t_diff = combined_model(img, text_match, text_diff, batch, seq_len_match, seq_len_diff,
  File "/home/engine210/MMFinal2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/engine210/MMFinal2/COSMOS/model_archs/models.py", line 51, in forward
    img = self.maskrcnn_extractor(img, bboxes, bbox_classes)
  File "/home/engine210/MMFinal2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/engine210/MMFinal2/COSMOS/model_archs/image/image_models.py", line 43, in forward
    targets = [annotations_to_instances(bbox.cpu().numpy(), bbox_class.cpu().numpy(), img_shape) for
  File "/home/engine210/MMFinal2/COSMOS/model_archs/image/image_models.py", line 43, in <listcomp>
    targets = [annotations_to_instances(bbox.cpu().numpy(), bbox_class.cpu().numpy(), img_shape) for
  File "/home/engine210/MMFinal2/COSMOS/utils/img_model_utils.py", line 23, in annotations_to_instances
    target.classes = classes
  File "/home/engine210/MMFinal2/detectron2/detectron2/structures/instances.py", line 61, in __setattr__
    self.set(name, val)
  File "/home/engine210/MMFinal2/detectron2/detectron2/structures/instances.py", line 76, in set
    assert (
AssertionError: Adding a field of length 11 to a Instances of length 1
My environment is
CentOS 7
CUDA 11.0/10.1
Python 3.8.1
I installed detectron2 v0.3 (commit 4841e70) with the modified code provided in this repo. I think it's the problem with detectron2 version. May I ask what version (or more specifically which commit) of detectron2 should we use in this project?
shivangi-aneja commented 3 years ago
It's weird that you're getting the error after 2% being trained. What batch size are you using? For sanity check, could you try running on tiny subsets of the dataset, say just 1 sample, then 32 samples, and so on, and see where it breaks.
shivangi-aneja commented 3 years ago
Closing the issue due to inactivity
engine210 commented 3 years ago
I've tried to train with just 32 examples and it works fine.
I also tried to reproduce this error but most of the time I encounter the bug below https://github.com/shivangi-aneja/COSMOS/issues/3#issue-915832803 And after fixing the augmentation problem, none of the above bugs happens again. Thanks for your help!
shivangi-aneja / COSMOS

Training error about detectron2 #2