Open ma787639046 opened 4 hours ago
I'm not exactly sure what the expectation would be here: Do you expect all ranks to use the same amount of memory or that only rank 0 uses memory? I tried your script on 2 GPUs with and without PEFT, i.e. just using model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map=rank)
, and for me the end result is very similar.
I'm not exactly sure what the expectation would be here: Do you expect all ranks to use the same amount of memory or that only rank 0 uses memory? I tried your script on 2 GPUs with and without PEFT, i.e. just using
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map=rank)
, and for me the end result is very similar.
Hi @BenjaminBossan, thanks for your quick response.
1) I'm expecting that all ranks load their own model, and end up using the same amount of memory.
2) facebook/opt-350m
seems not a LoRA model. The issue only exists when loading a LoRA model (with a LoRA weight adapter_model.safetensors
and a LoRA config adapter_config.json
, where its base model will be loaded automatically from "base_model_name_or_path"
in adapter_config.json
).
For me, it's the case that each process acquires approximately the same amount of memory.
The reason why I mentioned facebook/opt-350m
is because it is the base model of the model you're using and, as you stated, it does not use LoRA. Still, the memory distribution is the same for me. Please check if that's different for you. If you also find that the behavior is the same without LoRA, that tells us that LoRA is not the issue here.
You're probably encountering a different problem and the snippet you showed is to try to simplify it. But maybe it is simplified too much. Can you report your initial problem? Are you trying to run training or inference with a LoRA model and DDP?
For me, it's the case that each process acquires approximately the same amount of memory.
The reason why I mentioned
facebook/opt-350m
is because it is the base model of the model you're using and, as you stated, it does not use LoRA. Still, the memory distribution is the same for me. Please check if that's different for you. If you also find that the behavior is the same without LoRA, that tells us that LoRA is not the issue here.You're probably encountering a different problem and the snippet you showed is to try to simplify it. But maybe it is simplified too much. Can you report your initial problem? Are you trying to run training or inference with a LoRA model and DDP?
@BenjaminBossan Hi, thanks for the reply.
Yes, here I'm trying to provide the minimal script to reproduce the issue I met. I'm developing a distributed inferencing pipeline. It needs to spawn processes with torchrun. Each rank will load LoRA model on its own and do inferencing.
I have a simpler script to reproduce it:
In file test_lora.py
:
import os
from peft import AutoPeftModelForCausalLM
if __name__ == '__main__':
rank: int = 1
model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora", device_map=rank)
print(f"{rank} Finished loading.")
end = input() # Pause
This script tries to load a LoRA model directly to Rank 1. Rank 0 is not loading anything. However, when I try it with 2 GPU environment, Rank 0 has 1100 MiB GPU MEM usage.
However, When I switch to load a base model (not a LoRA model), e.g. facebook/opt-350m
, with AutoModel.from_pretrained()
of transformers
. This issue won't happen. This issue only appears when loading a LoRA model.
System Info
python==3.11.8 peft==0.13.2 accelerate==0.34.2 transformers==4.44.2
Who can help?
@BenjaminBossan @sayakpaul
Information
Tasks
examples
folderReproduction
When I'm loading Peft Model with torchrun, Rank1~7 will also occupy the GPU MEM of GPU0.
Reproduction Script
test_peft.py
:Execute
torchrun --nproc_per_node 8 test_peft.py
, The GPU MEM are as follows:Expected behavior
It seems that Rank1~7 also creates CUDA contexts with additional GPU MEM usage on Rank0. However, this should not be expected.