Loading bigger models is very slow using `AutoModelForCausalLM.from_pretrained`

System Info

transformers version: 4.45.0
Platform: Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.26
Python version: 3.10.14
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 0.34.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: Yes
Using GPU in script?: Yes
GPU type: NVIDIA A10G

Who can help?

@ArthurZucker @SunMarc

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am spawning an g5.12xlarge GPU machine on AWS sagemaker and I am loading a locally saved model using this script:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id_or_path = "<local_path>"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

This is the problem with almost all the models I am trying - rhymes-ai/Aria can be used to reproduce it.

Expected behavior

The last line takes forever to load the model (>40-50 mins). I have observed the same behaviour for multiple other models as well.

Things I have tried/observed:

The behaviour is observed for the first time on an instance after an instance restart. Once I have loaded the model (by waiting for 40-50 mins) and then I restart the notebook kernel - all the subsequent model loads are very fast (almost instant).
However, if I restart the instance - the problem is again observed for the first load.
I suspected that it is taking time to figure out which GPU to put which layer on as I am using a cluster of 4 GPUs. For solving this, I saved the device_map of an already loaded model and passed it on to the loading constructor as device_map instead of using auto but it didn't not solve the issue.
I also suspected that it might be an issue with slow memory read/write speeds so I benchmarked that by loading the model on CPU - it loaded in an instant so memory and I/O is not a blocker.
The shards are following the same behaviour. For example, if I am trying to load a model having 10 shards and restarted the notebook after the first 4 shards are loaded - loading the model again takes very less time for those first 4 shards. I have verified that before the second model load - the usage is 0 so no leftover layers are remaining from the first load.
The time taken to load the shards is also not uniform - some shards take 3 minutes, others are taking more than 10-15 mins.
The model is saved in BF16 format already - so typecasting also doesnt seem to be an issue here.

huggingface / transformers

Loading bigger models is very slow using `AutoModelForCausalLM.from_pretrained` #34798