Using distributed or parallel set-up in script?: Yes
Using GPU in script?: Yes
GPU type: NVIDIA A10G
Who can help?
@ArthurZucker @SunMarc
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
I am spawning an g5.12xlarge GPU machine on AWS sagemaker and I am loading a locally saved model using this script:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model_id_or_path = "<local_path>"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
This is the problem with almost all the models I am trying - rhymes-ai/Aria can be used to reproduce it.
Expected behavior
The last line takes forever to load the model (>40-50 mins). I have observed the same behaviour for multiple other models as well.
Things I have tried/observed:
The behaviour is observed for the first time on an instance after an instance restart. Once I have loaded the model (by waiting for 40-50 mins) and then I restart the notebook kernel - all the subsequent model loads are very fast (almost instant).
However, if I restart the instance - the problem is again observed for the first load.
I suspected that it is taking time to figure out which GPU to put which layer on as I am using a cluster of 4 GPUs. For solving this, I saved the device_map of an already loaded model and passed it on to the loading constructor as device_map instead of using auto but it didn't not solve the issue.
I also suspected that it might be an issue with slow memory read/write speeds so I benchmarked that by loading the model on CPU - it loaded in an instant so memory and I/O is not a blocker.
The shards are following the same behaviour. For example, if I am trying to load a model having 10 shards and restarted the notebook after the first 4 shards are loaded - loading the model again takes very less time for those first 4 shards. I have verified that before the second model load - the usage is 0 so no leftover layers are remaining from the first load.
The time taken to load the shards is also not uniform - some shards take 3 minutes, others are taking more than 10-15 mins.
The model is saved in BF16 format already - so typecasting also doesnt seem to be an issue here.
System Info
transformers
version: 4.45.0Who can help?
@ArthurZucker @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am spawning an
g5.12xlarge
GPU machine on AWS sagemaker and I am loading a locally saved model using this script:This is the problem with almost all the models I am trying -
rhymes-ai/Aria
can be used to reproduce it.Expected behavior
The last line takes forever to load the model (>40-50 mins). I have observed the same behaviour for multiple other models as well.
Things I have tried/observed:
device_map
of an already loaded model and passed it on to the loading constructor asdevice_map
instead of usingauto
but it didn't not solve the issue.0
so no leftover layers are remaining from the first load.