huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.28k stars 1.61k forks source link

Strange GPU MEM Occupation on GPU0 when using torchrun #2193

Open ma787639046 opened 4 hours ago

ma787639046 commented 4 hours ago

System Info

python==3.11.8 peft==0.13.2 accelerate==0.34.2 transformers==4.44.2

Who can help?

@BenjaminBossan @sayakpaul

Information

Tasks

Reproduction

When I'm loading Peft Model with torchrun, Rank1~7 will also occupy the GPU MEM of GPU0.

Reproduction Script test_peft.py:

import os
from peft import AutoPeftModelForCausalLM

if __name__ == '__main__':
    rank: int = int(os.getenv("RANK"))
    model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora", device_map=rank)

    print(f"{rank} Finished loading.")
    end = input()   # Pause

Execute torchrun --nproc_per_node 8 test_peft.py, The GPU MEM are as follows:

截屏2024-11-01 16 49 39

Expected behavior

It seems that Rank1~7 also creates CUDA contexts with additional GPU MEM usage on Rank0. However, this should not be expected.

截屏2024-11-01 16 51 22
BenjaminBossan commented 3 hours ago

I'm not exactly sure what the expectation would be here: Do you expect all ranks to use the same amount of memory or that only rank 0 uses memory? I tried your script on 2 GPUs with and without PEFT, i.e. just using model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map=rank), and for me the end result is very similar.

ma787639046 commented 2 hours ago

I'm not exactly sure what the expectation would be here: Do you expect all ranks to use the same amount of memory or that only rank 0 uses memory? I tried your script on 2 GPUs with and without PEFT, i.e. just using model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map=rank), and for me the end result is very similar.

Hi @BenjaminBossan, thanks for your quick response.

1) I'm expecting that all ranks load their own model, and end up using the same amount of memory. 2) facebook/opt-350m seems not a LoRA model. The issue only exists when loading a LoRA model (with a LoRA weight adapter_model.safetensors and a LoRA config adapter_config.json, where its base model will be loaded automatically from "base_model_name_or_path" in adapter_config.json).

BenjaminBossan commented 2 hours ago

For me, it's the case that each process acquires approximately the same amount of memory.

The reason why I mentioned facebook/opt-350m is because it is the base model of the model you're using and, as you stated, it does not use LoRA. Still, the memory distribution is the same for me. Please check if that's different for you. If you also find that the behavior is the same without LoRA, that tells us that LoRA is not the issue here.

You're probably encountering a different problem and the snippet you showed is to try to simplify it. But maybe it is simplified too much. Can you report your initial problem? Are you trying to run training or inference with a LoRA model and DDP?

ma787639046 commented 1 hour ago

For me, it's the case that each process acquires approximately the same amount of memory.

The reason why I mentioned facebook/opt-350m is because it is the base model of the model you're using and, as you stated, it does not use LoRA. Still, the memory distribution is the same for me. Please check if that's different for you. If you also find that the behavior is the same without LoRA, that tells us that LoRA is not the issue here.

You're probably encountering a different problem and the snippet you showed is to try to simplify it. But maybe it is simplified too much. Can you report your initial problem? Are you trying to run training or inference with a LoRA model and DDP?

@BenjaminBossan Hi, thanks for the reply.

Yes, here I'm trying to provide the minimal script to reproduce the issue I met. I'm developing a distributed inferencing pipeline. It needs to spawn processes with torchrun. Each rank will load LoRA model on its own and do inferencing.

I have a simpler script to reproduce it: In file test_lora.py:

import os
from peft import AutoPeftModelForCausalLM

if __name__ == '__main__':
    rank: int = 1
    model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora", device_map=rank)

    print(f"{rank} Finished loading.")
    end = input()   # Pause

This script tries to load a LoRA model directly to Rank 1. Rank 0 is not loading anything. However, when I try it with 2 GPU environment, Rank 0 has 1100 MiB GPU MEM usage.

截屏2024-11-01 20 15 10

However, When I switch to load a base model (not a LoRA model), e.g. facebook/opt-350m, with AutoModel.from_pretrained() of transformers. This issue won't happen. This issue only appears when loading a LoRA model.