openvinotoolkit / training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
https://openvinotoolkit.github.io/training_extensions/
Apache License 2.0
1.14k stars 442 forks source link

Problems in multi-card distributed training #3635

Closed nowbug closed 1 week ago

nowbug commented 2 months ago

The problem of distributed training blocking

Steps to Reproduce

1、Minimum code block

from otx.engine import Engine

engine = Engine(model="yolox_s", data_root="pwd") engine.train(num_nodes=2)

2.I tried other code to troubleshoot my environment.

import lightning as L from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 2 model = BoringModel() trainer = L.Trainer(max_epochs=10, devices=ngpus)

trainer.fit(model)

log: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Environment:

nowbug commented 2 months ago

When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

harimkang commented 2 months ago

@eunwoosh could you take a look at this issue?

harimkang commented 2 months ago

I found some open issues related to this and commented on them.

eunwoosh commented 2 months ago

Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon. And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it. I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.

nowbug commented 2 months ago

@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.