Puzzling behavior with loading time for large models - Llama 2 70B

chiragjn commented 10 months ago

System Info

- `Accelerate` version: 0.25.0
- Platform: Linux-5.15.0-1053-azure-x86_64-with-glibc2.31
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 432.95 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

I am noticing some weird not so easy to reproduce behaviour. I am working with Llama 2 70B for qlora fine-tuning on 2 x A100 80 GB GPUs in DDP mode.

The safetensors weights are already present on the disk When I load the model with the below config, it takes about ~45 mins (2700 seconds)!

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    device_map={"": os.getenv("LOCAL_RANK")},
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2",
  )

So I tried loading the model without DDP and quantization using transformers

pipeline("text-generation", "NousResearch/Llama-2-7b-chat-hf", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")

And it took about 730 seconds.

But once that was fully done and when I quit and relaunched my finetuning script, almost every time the model manages to load within 90s seconds which was very puzzling to me.

I know there are factors like active memory and GPU memory consumption that go into accelerate's dispatch calculations, so I have waited for several minutes between runs and made sure GPU memory is clear before starting.

What can explain such a dramatic speedup?

Expected behavior

Ideally, it would be amazing if such a large model could load within 90s seconds every time consistently

EDIT: Saw same behavior with Mixtral

In [2]: from transformers import AutoConfig, pipeline

In [3]: import torch

In [4]: p = pipeline("text-generation", "mistralai/Mixtral-8x7B-Instruct-v0.1", device_map="auto", torch_dtype=torch.bfloat16)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [29:15<00:00, 92.41s/it]

In [5]: exit
(ft) ubuntu@a10080gbx2-eastus-spot:/data$ ipython
Python 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.17.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: from transformers import AutoConfig, pipeline

In [3]: p = pipeline("text-generation", "mistralai/Mixtral-8x7B-Instruct-v0.1", device_map="auto", torch_dtype=torch.bfloat16)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 19/19 [00:12<00:00,  1.54it/s]

BenjaminBossan commented 10 months ago

This is unlikely to explain your observations, but theoretically it could be disk caching at work when it's always slow the first time around to load the model but fast the second time.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate