hejing / instance_containize

any issues when using the bitdeer.ai
0 stars 0 forks source link

A Limited 1Gbps bandwidth is a bottleneck in distributing training and causes high latency in transfer gradient data between multi-nodes. #7

Open jianwang-ntu opened 3 months ago

jianwang-ntu commented 3 months ago

Issue:

When initiating Distributed Training with two instances, InstanceA and InstanceB, each equipped with two GPUs, two significant bottlenecks block further progress:

  1. Bandwidth limitation: The bandwidth is limited at 1Gbps, causing traffic jam during data transfer among multiple nodes. The following picture was first screenshot when we assessed the maximum bandwidth Bitdeer could provide.
image

The total estimated time for this job was initially 2.5 hours with 4 GPUs on a single node. However, with the current 1Gbps bandwidth for distributed training, the projected training time has surged to 45 hours.

image
  1. Port constraints: Another challenge arises from port limitations. It took us some time to understand why InstanceA and InstanceB couldn't communicate via TCP or Socket. Without familiarity with bitdeer whitelist policy, beginners might take multiple time to find themselves in debugging their distributed training script rather than pinpointing the actual issue, port constraint.

our suggestion is to free the port constraints in intranet, such as 172.0.0.1/255, 192.168.0.1/255 or 10.0.0.1/255.

image