Closed chenmingjiong closed 1 year ago
Hello @chenmingjiongjiong
What is the VRAM of your GPU?
can you alternatively try device_map={'':torch.cuda.current_device()}
?
can you alternatively try
device_map={'':torch.cuda.current_device()}
This solved my problem. Thanks!
Then I got another error about bitsandbytes, I have submitted an issue in their repo.
Wow, this is interesting! Could you explain why this trick works?
Sure @beyondguo
Per my understanding, and if I got it right it should very simple. device_map={"":0}
simply means "try to fit the entire model on the device 0" - device 0 in this case would be the GPU-0
In a distributed setting torch.cuda.current_device()
should return the current device the process is working on. If you have 4 GPUs and running DDP with 4 processes each process should be working on an independent GPU, meaning that if each process load a model with device_map={"":i}
the process i
will try to fit the entire model on the GPU i
, this leads to properly having n
working processes that have a replica of the model.
I remember I had some issues while using torch.cuda.current_device()
therefore now I advise users to use accelerate
instead and retrieve the current process index with the following trick:
from accelerate import Accelerator
dummy_accelerator = Accelerator()
current_device = dummy_accelerator.process_index
Let me know if anything is unclear
Thanks @younesbelkada Now I'm using LoRA to tune a LLM (ChatGLM-6B) using 2 * A800 80G. I've got some findings that really confuse me.
The first problem:
device_map="auto”
to my understanding means setting model parallelization (MP), which will put the model layers into different devices. Thus, during training, only one GPU is calculating.model.is_parallelizable=False
means I don't want to set MP.However, if I both set device_map="auto”
and model.is_parallelizable=False
, model parallelization is still activated. I think model.is_parallelizable=False
should block the model parallelization.
Second problem:
device_map={'':torch.cuda.current_device()}
, it means the model is copied to both GPUs. However, I found the latter method consumes nearly the save GPU memories per GPU as the first method. Why? I thought it should only consume half the memories per GPU compared with the first method.
One more thing, using device_map="auto"
, the batch size is halved, compared with device_map={'':torch.cuda.current_device()}
, however, it is even 1.5 x faster! Could you please explain why this happens? Many thanks!
Hi @beyondguo Thanks for looping back 1- Yes setting device_map = auto means that you want to set Model Parallelism, meaning putting the model into different GPU layers and one GPU at a time will be used 2- I think in the latest versions of transformers this argument is not needed anymore Regarding the second problem I think this is expected, if you run things correctly if you have a copy of the model in 2 GPUs you will also have 2 copies of the optimizer states and the input data will be also split across both processes
Thanks for your detailed reply! @younesbelkada
To my understanding, when using device_map="auto"
, only a subset of all layers is allocated to one GPU, which should lead to lower GPU consumption. However, it consumes nearly the same GPU memories as setting device_map={'':torch.cuda.current_device()}
.
I see, thanks for your reply! Can you provide more details (how many GBs allocated, which model, etc.?) Thanks!
Sure. Model: ChatGLM-6B device: 4 * A800-80G
70 GBs allocated for each GPU.
The code I'm using is https://github.com/beyondguo/LLM-Tuning/blob/796384e837b3b6d70564d50ef5bb46f9175cb700/chatglm_lora_tuning.py#L87
Thanks for sharing those
Model: ChatGLM-6B
I see the model is running in full precision, a 6B model would require 24GB VRAM just to be loaded on the GPU
70 GBs allocated for each GPU.
Do you run your script using torch.distributed.run
or just python yourscript.py
?
simply python yourscript.py
, I'm using Trainer, which I think should automatically manage the GPU allocation.
I see better now, if you want to benefit from data parallelism as mentioned here: https://github.com/huggingface/transformers/issues/21736#issuecomment-1595699638 or in the original message from the author you need 2 things:
accelerate config
--> select multi GPU then run your script with accelerate launch yourscript.py
. to make sure that only the main process saves the model you can add a simple check in the model.save_pretrained
and do something like that instead:
if trainer.accelerator.is_main_process:
model.save_pretrained(training_args.output_dir)
Thanks! I will try these later.
Hi @younesbelkada
Sorry to bother you again. I'm still working on the "device_map" thing... I'm curious how does transformers
automatically allocate the layers to different GPUs.
When I load the ChatGLM-6B model, using device_map="auto"
, I see the layers are allocated to:
{'transformer.word_embeddings': 0,
'lm_head': 0, <-----
'transformer.layers.0': 0,
'transformer.layers.1': 0,
'transformer.layers.2': 0,
'transformer.layers.3': 0,
'transformer.layers.4': 0,
'transformer.layers.5': 1,
'transformer.layers.6': 1,
'transformer.layers.7': 1,
'transformer.layers.8': 1,
'transformer.layers.9': 1,
'transformer.layers.10': 1,
'transformer.layers.11': 1,
'transformer.layers.12': 1,
'transformer.layers.13': 1,
'transformer.layers.14': 2,
'transformer.layers.15': 2,
'transformer.layers.16': 2,
'transformer.layers.17': 2,
'transformer.layers.18': 2,
'transformer.layers.19': 2,
'transformer.layers.20': 2,
'transformer.layers.21': 2,
'transformer.layers.22': 2,
...
'transformer.layers.24': 3,
'transformer.layers.25': 3,
'transformer.layers.26': 3,
'transformer.layers.27': 3,
'transformer.final_layernorm': 3}
And when I change the model to ChatGLM2-6B, the allocation is:
{'transformer.embedding': 0,
'transformer.rotary_pos_emb': 0,
'transformer.encoder.layers.0': 0,
'transformer.encoder.layers.1': 0,
'transformer.encoder.layers.2': 0,
'transformer.encoder.layers.3': 0,
'transformer.encoder.layers.4': 0,
'transformer.encoder.layers.5': 0,
'transformer.encoder.layers.6': 1,
'transformer.encoder.layers.7': 1,
'transformer.encoder.layers.8': 1,
'transformer.encoder.layers.9': 1,
'transformer.encoder.layers.10': 1,
'transformer.encoder.layers.11': 1,
'transformer.encoder.layers.12': 1,
'transformer.encoder.layers.13': 1,
'transformer.encoder.layers.14': 2,
'transformer.encoder.layers.15': 2,
'transformer.encoder.layers.16': 2,
'transformer.encoder.layers.17': 2,
'transformer.encoder.layers.18': 2,
'transformer.encoder.layers.19': 2,
'transformer.encoder.layers.20': 2,
'transformer.encoder.layers.21': 2,
'transformer.encoder.layers.22': 3,
...
'transformer.encoder.layers.25': 3,
'transformer.encoder.layers.26': 3,
'transformer.encoder.layers.27': 3,
'transformer.encoder.final_layernorm': 3,
'transformer.output_layer': 3} <-----
My question is, the lm_head
layer in ChatGLM-6B and the output_layer
in ChatGLM2-6B are both the last layer of the models, but why lm_head
is in cuda:0
(same as the input layer), the output_layer
is put in cuda:3
(different from the input layer).
Because of this, when I train the ChatGLM-6B, every things is fine; but when I train the ChatGLM2-6B, an error occurs during the model forward pass loss computing:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
Do you know what's the problem? How can I fix this? Many thanks!
update:
I have a workaround (which I think is too ugly, lol):
model.hf_device_map['transformer.output_layer'] = model.hf_device_map['transformer.embedding']
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device_map=model.hf_device_map)
which is to manually change the output_layer
's device, and reload the model.
Hi @beyondguo
Thanks for the ping, and no problem at all
device_map='auto'
will dispatch the model evenly across all available GPUs.
I think the issue you are facing is related to the fact that for the first model the weight is probably tied with the embedding layer (i.e. they are the same), hence the device of that layer being on the first GPU device. For the second model, maybe the lm_head is not tied to the embedding layer. Regarding your solution, I think it looks fine, you can probably load the first model on the meta device using init_empty_weights()
context manager from accelerate and make it slightly more efficient.
Thanks!
Hey, Ive tried "everything" now, but cant get 8bit lora multi-gpu training to work. I have a minimal example here:
https://gist.github.com/simeneide/80aa37108474aa32b82cb7258778287b
Also tried the device_map={'':torch.cuda.current_device()}
trick above without success. Not really sure what you are doing, @beyondguo ?
Anyone? Im getting desperate 😂
transformers==4.31
bitsandbytes==0.41.1
accelerate==0.21.0
torch == 2.0.1
Hi @simeneide
Thanks for the ping, can you try out the solution proposed in this comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994 ?
I dont hope the ping was during sleeping hours 😬
Yes, that worked. Thank you very much!
Hahah no worries it wasn't ! Great that the solution worked! :D
System Info
transformers
version: 4.27.0.dev0Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I got this error when finetuning "EleutherAI/gpt-j-6B" using LoRA on 8×2080ti:
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Reproduce steps: clone this repo: https://github.com/CarperAI/trlx modify the script: examples/summarize_rlhf/sft/train_gptj_summarize.py
and run:
accelerate launch --num_processes 8 examples/summarize_rlhf/sft/train_gptj_summarize.py
Full error logs:
Expected behavior
I'm using 8×2080ti. When training using 1×2080ti and running
python examples/summarize_rlhf/sft/train_gptj_summarize.py
, the above code runs normally, which means the model and data can fit in only one gpu. Then I want to use data parallelism and do not use model parallelism, just like DDP. Theload_in_8bit
option in.from_pretrained()
requires settingdevice_map
option. Withdevice_map='auto'
, it seems that the model is loaded on several gpus, as in naive model parallelism, which results in this error:RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
while training. May be settingdevice_map
correctly should solve this problem, but I can't find how to do this in document.