Fix error when saving checkpoint

guoqincode / Open-AnimateAnyone

Unofficial Implementation of Animate Anyone

2.89k stars 233 forks source link

Fix error when saving checkpoint #34

Closed mymusise closed 8 months ago

mymusise commented 8 months ago

Fix UNet2DConditionModel object has no attribute 'module' error when saving checkpoint

guoqincode commented 8 months ago

For DDP training, we need to use model.module to obtain the unwrapped model.

mymusise commented 8 months ago

@guoqincode You're correct. I had the DDP commented out while running, which led to the error mentioned. I attempted to run it on a single GPU with 48GB, but adding the DDP warpper invariably causes an OOM error. I'm relatively new to distributed training, so I'd appreciate any insights into why this might be occurring.

guoqincode commented 8 months ago

Using DistributedDataParallel (DDP) on a single GPU increases memory usage due to the creation of additional copies of the model and its parameters for each process, even in a single-GPU setup. DDP also allocates extra gradient buffers for synchronizing gradients across processes, leading to more memory consumption. This is a trade-off for its design optimized for multi-GPU and distributed environments, not ideal for single-GPU setups.