huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

Big Models, move model to CPU after dispatching to multiple devices #2840

Closed balaabhijit closed 3 weeks ago

balaabhijit commented 3 weeks ago

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.74 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
    Not found

Information

Tasks

Reproduction

This PR #1790 seems to add a warning when trying to move a model which is dispatched on multiple devices. Is there a way to move the model back to CPU once we dispatch to multiple devices? I even tried saving the model, deleting and reloading the model but, deleting the model in python with del model (Even with GC and cuda cache clear) doesn't seem to clear GPU memory. Is there anyway to achieve this?

Expected behavior

Move model to CPU after dispatching to multiple GPUs

SunMarc commented 3 weeks ago

Hi @balaabhijit, thanks for reporting ! What would be the use case for this ? Are you modifying the model ? To answer your question, you can use remove the hooks by calling remove_hook_from_module(model, recurse=True), then you should be able to move the model to your chosen device. As for the warning, I will create a PR, so that we remove it when calling remove_hook_from_module ! LMK if this works !

balaabhijit commented 3 weeks ago

@SunMarc Thanks for your prompt reply. This works!

The use case is a very specific case where we are trying to quantize the model as part of a pipeline. The previous step requires the model to be distributed and the quantization step (AWQ) happens layer by layer so as to save GPU memory. So I had to move the model back and forth from the device

SunMarc commented 3 weeks ago

Sounds good ! I'm closing this issue since this is solved !