libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.82k stars 178 forks source link

for training imagenet, how to train in multiple machines with same gpu numbers? #124

Closed jeannotes closed 2 years ago

jeannotes commented 2 years ago

when train imagenet, if we have 4 machines with same number of gous, how to train imagenet, any instructions?

thanks!

GuillaumeLeclerc commented 2 years ago

FFCV replaces the DistributedSampler of pytorch. Otherwise the usage is the same. Please refer to Pyotrch documentation for more information.

PS: We rarely find it useful to train ImageNet on more than one machine, you get significant slowdown if you don't have a high speed networking interface and you pay more at each gradient synchronization.

jeannotes commented 2 years ago

@GuillaumeLeclerc hopes that you give a hint. like this python -m torch.distributed.launch \ --nproc_per_node=2 \ --nnodes=2 \ --node_rank=1 \ train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \ --data.train_dataset=/home/star/Documents/jh/data/imagenet/imagenet_generate/train_500_0.50_90.ffcv \ --data.val_dataset=/home/star/Documents/jh/data/imagenet/imagenet_generate/val_500_0.50_90.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder=/home/star/Documents/jh/logs/imagenet \ --address = 30.1.100.134 --port = 1234 \ --launch pytorch ?

thanks!

GuillaumeLeclerc commented 2 years ago

To train in a distributed fashion you need to spawn multiple processes. You can either

  1. spawn them from inside the main script script
  2. use torch.distributed.launch (which acts as the main script).

Because ti's easier to use for most people in our example we use the first option. What I suspect is happening in your example is that you are doing both 1. and 2. To make it work with option 2. you have to remove the section of the code in our example that spawn the sub-processes or you are just going to have too many and they are going to fight for the GPUs. Moreover a machine won't even be aware that it is supposed to collaborate with another one.

I suggest you start with a piece of code that works on your cluster without FFCV and then integrate FFCV inside it. We can't have it work at of the box as there are many cluster configuration and different way of spawning the jobs on the individual machines.