sovit-123 / fasterrcnn-pytorch-training-pipeline

PyTorch Faster R-CNN Object Detection on Custom Dataset
MIT License
223 stars 75 forks source link

Multinode run problem #106

Closed unrue closed 10 months ago

unrue commented 1 year ago

Hi,

I'm using such tool on HPC machine, having 4 gpus per node. This is the launch command for 2 nodes and 4 gpu for each:

`
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "MASTER_PORT"=$MASTER_PORT
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$(getent hosts $master_addr | awk '{ print $1 }')
echo "MASTER_ADDR="$MASTER_ADDR
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --use_env fasterrcnn-pytorch-training-pipeline/train.py --data ***/Deep_Learning/MIC_DL/mic_env_ai/tools/fiftyone_venv/conversions/for_detectron2/beni_culturali/beni_culturali_pvoc_stratified_negative_coords_translated/beni_culturali.yaml --epochs 1 --model fasterrcnn_resnet50_fpn --name beni_culturali_check --batch 1 --disable-wandb --workers 0
`

The code seems stuck and doing nothing. I'm doing something wrong?

`
  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : fasterrcnn-pytorch-training-pipeline/train.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/slurm_job.1745194/torchelastic_s78szpdv/none_e_rmw4ws
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group

    agent_data = get_all(store, rank, key_prefix, world_size)
  File "****/Deep_Learning/MIC_DL/test_detectron_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout

`
sovit-123 commented 1 year ago

I see that you have pass a --nnodes=2 argument. I am afraid that multi-node training is not supported at the moment. It supports single node multi-GPU training. Can you please start with a single node and 2 GPU training? That's what I have tested till date. I am trying to scale the code further but will take some time in doing so.

unrue commented 1 year ago

Ok, understood. The code is not multinode, but Multi-GPU yes. Why I don't see dist.all_reduce in training loop? Is somewhere? How gradients are synchronized among GPUs?

sovit-123 commented 1 year ago

You are right. There is a mistake. I have used SyncBatchNorm but forgot all_reduce. I will surely push the correct and updated code as soon as possible.

sovit-123 commented 10 months ago

Hi, the all_reduce has been included in the reduce_dict function in utils.py. I hope this resolves the issue. I am closing the issue for now. Please reopen if necessary.