jbwang1997 / OBBDetection

OBBDetection is an oriented object detection library, which is based on MMdetection.
Apache License 2.0
518 stars 111 forks source link

Distributed Multi-GPU Training did not decrease training time #169

Open chandlerbing65nm opened 2 years ago

chandlerbing65nm commented 2 years ago

When I tried distributed training for 2 RTX A100 GPU's with batch size of 4 images per GPU, the training time did not decrease.

When I change batch size to 8 images per GPU, I get this error:

Traceback (most recent call last):
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
622tongtong commented 1 month ago

Hi. Have you managed to resolve this issue? I am currently experiencing the same problem where using multiple GPUs results in each GPU having the same memory usage as when using a single GPU. If you have any solutions or suggestions, could you please share them with me? Thank you very much!