huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Move to cpu takes extra memory usage after .gather() #2898

Closed xinghaow99 closed 3 days ago

xinghaow99 commented 5 days ago

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.31
- `accelerate` bash location: /home/jovyan/conda-env/envs/gloq/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1878.24 GB
- GPU type: NVIDIA H800
- `Accelerate` default config:
        Not found

Information

Tasks

Reproduction

Hi! I'm doing multi-gpu inference of a specific layer of a model, and trying to save the IOs. To save gpu RAM, I only move the tensors to gpus when computing. Here is my code:

layer = layer.to(accelerator.device)
    layer_outputs = []
      for i in tqdm(range(0, len(layer_inputs), batch_size*accelerator.num_processes)):
          with accelerator.split_between_processes(layer_inputs[i: i+batch_size*accelerator.num_processes]) as sharded_inputs:
              batch_inputs = torch.stack(sharded_inputs).to(accelerator.device)
              batch_outputs = layer(batch_inputs)
          batch_outputs = accelerator.gather(batch_outputs)
          batch_outputs = batch_outputs.to('cpu')
          for j in batch_outputs:
              layer_outputs.append(j)
layer = layer.to('cpu')

The issue is when batch_outputs = batch_outputs.to('cpu') happens, it takes accelerator.num_processes times memory than expected.

Expected behavior

I want to save the gathered inference outputs into layer_outputs list only once. The memory usage should be same as layer_inputs.

BenjaminBossan commented 5 days ago

I'm not sure if I fully understand your issue, but would it help if you run this only on one process? I.e.:

if accelerator.is_main_process():
    # your code
xinghaow99 commented 5 days ago

@BenjaminBossan Thank you for your reply.

I tried:

if accelerator.is_main_process:
    batch_outputs = batch_outputs.to('cpu')
    for j in batch_outputs:
        layer_outputs.append(j)

But the whole process hangs, probably because the layer_outputs list (which should be a list of tensors on cpu) is empty except for the main process.

BenjaminBossan commented 4 days ago

Hmm, not sure why it would hang in that case. Hopefully someone else has a good idea here.

xinghaow99 commented 3 days ago

I managed to solve my problem with native pytorch.distributed operations. Thanks for the help!

muellerzr commented 1 day ago

It hangs because we need to declare the catchup point.

Or:

if accelerator.is_main_process:
    batch_outputs = batch_outputs.to('cpu')
    for j in batch_outputs:
        layer_outputs.append(j)
accelerator.wait_for_everyone()
muellerzr commented 1 day ago

Would love to know your solution though @xinghaow99 and I can translate it over to what should work in accelerate :)

xinghaow99 commented 1 day ago

Hi @muellerzr .

I used torch.distributed.scatter_object_list() instead of accelerator.split_between_processes(), and I refactored my script to avoid using gather(). I'm new to Accelerate and not sure if the problem is with Accelerate.

I suspect that the problem arises when using gather(), as it syncs the object across all devices, the memory consumption on each device is increased by 8x (lets say there are 8 processes). After calling .cpu(), these objects on different devices use separate space in CPU memory, thus taking up 8x the space they should be. And this maybe can be avoided by using torch.distributed.gather_object()? (I'm not sure)

I hope this is informative to you.