Closed mymusise closed 8 months ago
For DDP training, we need to use model.module to obtain the unwrapped model.
@guoqincode You're correct. I had the DDP commented out while running, which led to the error mentioned. I attempted to run it on a single GPU with 48GB, but adding the DDP warpper invariably causes an OOM error. I'm relatively new to distributed training, so I'd appreciate any insights into why this might be occurring.
Using DistributedDataParallel (DDP) on a single GPU increases memory usage due to the creation of additional copies of the model and its parameters for each process, even in a single-GPU setup. DDP also allocates extra gradient buffers for synchronizing gradients across processes, leading to more memory consumption. This is a trade-off for its design optimized for multi-GPU and distributed environments, not ideal for single-GPU setups.
Fix
UNet2DConditionModel object has no attribute 'module'
error when saving checkpoint