How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory?

fantasysee commented 2 years ago

Description

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)

Right now, I can run a ResNet18 on a single card by using os_cache=False. However, if I use in_memory=0 and distributed = 1 to run the provided train_imagenet.py code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?

Command

python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \
    ... \
    --data.in_memory=0 \
    --training.distributed=1

Message

Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

Not enough memory; try setting quasi-random ordering (OrderOption.QUASI_RANDOM) in the dataloader constructor's order argument.

Full error below: 0%| | 0/1251 [00:01<?, ?it/s] Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0> Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del self.close() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close self.memory_context.exit(None, None, None) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit self.executor.exit(args) AttributeError: 'ProcessCacheContext' object has no attribute 'executor' Traceback (most recent call last): File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in ImageNetTrainer.launch_from_args() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(args, *kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(args, filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:**

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper cls.exec(args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, *kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(args, filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec trainer.train() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, *filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train train_loss = self.train_loop(epoch) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop for ix, (images, target) in enumerate(iterator): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter return EpochIterator(self, selected_order) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init raise e File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init self.memory_context.enter() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter self.memory = np.zeros((self.schedule.num_slots, self.page_size), numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8**

lengstrom commented 2 years ago

You should try using OrderOption.RANDOM, QUASI_RANDOM isn't implemented for distributed training yet. Let me know if that fixes it for you and if not we can try something else!

fantasysee commented 2 years ago

Thank you very much!

It works! Multi-GPU training is enabled when I set dist.world_size=4, data.in_memory=0, and training.distributed=1 and use OrderOption.RANDOM.

However, I found that it is unexpectedly slow when distributed training is enabled for training a ResNet18 with ImageNet (about 31.73s/it, batch size=512). The training estimated time is almost equal to the Single-GPU training strategy.

Would you please tell me how I could speed up the training process?

netw0rkf10w commented 2 years ago

Same issue! https://github.com/libffcv/ffcv/issues/268

libffcv / ffcv-imagenet