Distributed Multi-GPU Training did not decrease training time

jbwang1997 / OBBDetection

OBBDetection is an oriented object detection library, which is based on MMdetection.

Apache License 2.0

518 stars 111 forks source link

When I tried distributed training for 2 RTX A100 GPU's with batch size of 4 images per GPU, the training time did not decrease.

When I change batch size to 8 images per GPU, I get this error:

Traceback (most recent call last):
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

jbwang1997 / OBBDetection

Distributed Multi-GPU Training did not decrease training time #169