torchpack multi-gpu error

YoushaaMurhij commented 1 year ago

Hi, I am trying to train the BEVFusion detection model on a Slurm based cluster and got this error when choosing np > 1:

+ torchpack dist-run -np 6 python3 tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/lidar-only-det.pth
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 6
slots that were requested by the application:

  /usr/bin/python3

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.

Do you have any idea, what could cause such a problem? I tried to use torchrunand torch.distributed.launch without luck! PS: I am sure that there are 6 allocated gpus for that task.

Thanks!

kentang-mit commented 1 year ago

Hi @YoushaaMurhij,

Thanks for your interest but sorry I am not an expert in slurm. Let me also involve @zhijian-liu in the discussion to see whether he has some idea for that.

Best, Haotian

YoushaaMurhij commented 1 year ago

Thanks for your reply! and happy holidays 🎆 I checked in torchpack for this problem and I think this issue is related. I am looking forward for your ideas! Thanks.

YoushaaMurhij commented 1 year ago

I could not solve this problem so I used torch.distributed.launch instead of torchpack!

bbzh commented 1 year ago

I could not solve this problem so I used torch.distributed.launch instead of torchpack!

Hi @YoushaaMurhij, would you please share the detailed steps for changing from torchpack to torch.distributed.launch? Thanks!

Estrellama commented 1 year ago

Hi @YoushaaMurhij, i met the same question with you, Would you please share the detailed steps for changing from torchpack to torch.distributed.launch? Thanks!

EpicGilgamesh commented 11 months ago

Hi! did you guys managed to change from torchpack to torch.distributed.launch?

mit-han-lab / bevfusion

torchpack multi-gpu error #277