Open fantasysee opened 2 years ago
You should try using OrderOption.RANDOM
, QUASI_RANDOM
isn't implemented for distributed training yet. Let me know if that fixes it for you and if not we can try something else!
Thank you very much!
It works! Multi-GPU training is enabled when I set dist.world_size=4, data.in_memory=0, and training.distributed=1
and use OrderOption.RANDOM
.
However, I found that it is unexpectedly slow when distributed training is enabled for training a ResNet18 with ImageNet (about 31.73s/it, batch size=512). The training estimated time is almost equal to the Single-GPU training strategy.
Would you please tell me how I could speed up the training process?
Same issue! https://github.com/libffcv/ffcv/issues/268
Description
Hi, @lengstrom . Thanks for your wonderful work!
My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.
I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)
Right now, I can run a ResNet18 on a single card by using
os_cache=False
. However, if I usein_memory=0
anddistributed = 1
to run the providedtrain_imagenet.py
code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?Command
Message
Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.
=> Logging in ...
Not enough memory; try setting quasi-random ordering (
OrderOption.QUASI_RANDOM
) in the dataloader constructor'sorder
argument.Full error below: 0%| | 0/1251 [00:01<?, ?it/s] Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0> Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del self.close() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close self.memory_context.exit(None, None, None) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit self.executor.exit(args) AttributeError: 'ProcessCacheContext' object has no attribute 'executor' Traceback (most recent call last): File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in
ImageNetTrainer.launch_from_args()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func( args, *kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(args, filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args
ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:**
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper cls.exec(args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, *kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(args, filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec trainer.train() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(*args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, *filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train train_loss = self.train_loop(epoch) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result return func(args, kwargs) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call return self.func(*args, filled_args) File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop for ix, (images, target) in enumerate(iterator): File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter return EpochIterator(self, selected_order) File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init raise e File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init self.memory_context.enter() File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter self.memory = np.zeros((self.schedule.num_slots, self.page_size), numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8**