Issue on running distributed training

mindspore-lab / mindocr

A toolbox of OCR models, algorithms, and pipelines based on MindSpore

https://mindspore-lab.github.io/mindocr/

Apache License 2.0

174 stars 44 forks source link

Issue on running distributed training #687

Open ThomasLimWZ opened 2 months ago

ThomasLimWZ commented 2 months ago

Hi, I am unable to run the distributed train using the GPU using this mpirun --allow-run-as-root -n 2 python tools/train.py --config configs/det/dbnet/db_r50_icdar15.yaml. I knew the issue was on the OpenMPI, but my PC is Windows-based and OpenMPI is no longer supported on Windows based on my understanding. Do you have any advice to solve my issue?

panshaowu commented 2 months ago

@ThomasLimWZ Hello, thanks for your feedback. As far as I know, mindspore' support for the Windows OS is incomplete. Please consider switching to the Linux OS.

As to the problem of running distributed training tasks, you can try the dynamic cluster startup method (refer to Distributed Parallel Startup Methods). MindSpore provides three distributed parallel startup methods (refer to Distributed Parallel Startup Methods), two of which support GPU.

ThomasLimWZ commented 2 months ago

Hi, I tried to use Windows Subsystem for Linux to run this repository, and is already resolved the issue of OpenMPI. But currently, I'm still facing some issues with both standalone training and distributed training. It returned to me the error messages that said that the [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[262144000].

Can I know what is the minimum hardware requirements for this mindocr? FYI, my RAM size is 24GB and my GPU is Nvidia 3050Ti only.

panshaowu commented 2 months ago

@ThomasLimWZ As far as I know, there is no MindSpore API to get the required RAM or graphics memory currently. But I am afraid that the 4GB graphic memory of 3050Ti GPU may be insufficient for training DBNet ResNet-50 with the default configurations. You can try to reduce the value of train.loader.batch_size and train.loader.num_workers in configs/det/dbnet/db_r50_icdar15.yaml. Also, you can try to switch to using DBNet ResNet-18.