High cpu memory usage as bf16 model is auto loaded as fp32

Qubitium commented 8 hours ago

System Info

Ubuntu 24.04 Transformers 4.46.2 Accelerator 1.1.1 Safetensor 0.4.5

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Unexpected 2x cpu memory usage due to bf16 safetensor loaded as float32 on device=cpu.

Manually passing torch_dtype=torch.bfloat16 has no such issue but this should not be necessary since both model.config and safentensor files has proper bfloat16.

Sample reproducing code:

import torch
from transformers import AutoModelForCausalLM
import psutil

# model is stored as bf16 safetensor
model_file = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_file)

process = psutil.Process()
memory_info = process.memory_info()
print(f"RSS (Resident Set Size): {memory_info.rss / 1024 / 1024:.2f} MB")
print(f"VMS (Virtual Memory Size): {memory_info.vms / 1024 / 1024:.2f} MB")

print(f"model config dtype is {model.config.torch_dtype}")
assert model.config.torch_dtype == torch.bfloat16

p = model.parameters().__next__()
print(f"model first parameter dtype: {p.dtype}, device: {p.device}")
assert p.device == torch.device("cpu")
assert p.dtype == torch.bfloat16

Code output:

Traceback (most recent call last):
  File "/GPTQModel/test.py", line 20, in <module>
    assert p.dtype == torch.bfloat16
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
RSS (Resident Set Size): 5189.39 MB <----- High memory usage
VMS (Virtual Memory Size): 41335.09 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.float32, device: cpu. <----- Wrong dtype

Expected behavior

Modify above code pass torch_dtype=torch.bfloat16 to from_pretrained and memory usage is normal/expected:

RSS (Resident Set Size): 603.85 MB <----- Expected memory usage
VMS (Virtual Memory Size): 40607.80 MB
model config dtype is torch.bfloat16
model first parameter dtype: torch.bfloat16, device: cpu

There are two related issues here:

bfloat16 wrongly inflated to float32 causing very high memory usage
safetensor weights should be lazy loading so it should only be around 600MB of weights loaded

Manually passing dtype=bfloat16 to from_pretrained fixes this issue.

LysandreJik commented 6 hours ago

Hey @Qubitium, the model was indeed serialized as bf16, but here you're not specifying in which dtype you would like to load it.

We follow torch's default loading mechanism, which is to automatically load it in the default torch.dtype (here, fp32) so as to be compatible with all hardwares and setups.

In order to update the dtype in which it should be loaded, please change this line:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype=torch.bfloat16)

You can also use 'auto' so as to respect the dtype of the weights themselves:

- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype='auto')

You can read more about this in the from_pretrained documentation which I am pasting below:

Qubitium commented 4 hours ago

@LysandreJik It's 2024 and I would like to propose that the default float32 be modified. Please read the below with a light heart.

Reasons:

The default float32 appears more of a fail-safe default than compat default in 2024
The default float32 will oom 99.9% of world's desktops when loading via cpu and 99.999% of the world's consumer/enterprise gpus when loading a 32B bfloat16 model (Qwen 2.5 Coder 32B as example). As far as compat, I would argue it does more harm than good and make more systems non-compatible when using existing default. fp32 would require ~120GB cpu ram/vram. Is 32B really a large model in 2024? It doesn't matter if safetensor/lazy is used, when first token is generated, all model layers are loaded into device.
AutoModelForCausalLM has auto in name, but it's only auto, sometimes. When? We don't know.
from_pretrained honor and read model properties by default in config.json but not dtype in said json.
Model maker makes bf16, user wants to load model using AutoModelForCausalLM and the api returns by default fp32. Assuming cpu/gpu device is compatible with config.dtype, why?
Less people know how to calculate 32B into ram/vram usage and even less know what is bfloat16 vs float32, or know how to read a config.json and extract the proper value. Make it easy for users by using better default.
There is little auto about dtype=auto. It reads from config.json, first, then does auto. What does auto mean in this context if it reads from config?

Overall, accept the config.json default as truth unless there is an override, or the default is really in-comptible with gpu/cpu: when a device does not physically support it model specified dtype.

torch_dtype (`str` or `torch.dtype`, *optional*):
     Override the default `torch.dtype` and load the model under a specific `dtype`. The different options
     are:

     1. `torch.float16` or `torch.bfloat16` or `torch.float`: load in a specified
      `dtype`, ignoring the model's `config.torch_dtype` if one exists. If not specified
      - the model will get loaded in `torch.float` (fp32).

      2. `"auto"` - A `torch_dtype` entry in the `config.json` file of the model will be
      attempted to be used. If this entry isn't found then next check the `dtype` of the first weight in
      the checkpoint that's of a floating point type and use that as `dtype`. This will load the model
      using the `dtype` it was saved in at the end of the training. It can't be used as an indicator of how
      the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.

      3. A string that is a valid `torch.dtype`. E.g. "float32" loads the model in `torch.float32`, "float16" loads in `torch.float16` etc.

      <Tip>

      For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
      reach out to the authors and ask them to add this information to the model's card and to insert the
      `torch_dtype` entry in `config.json` on the hub.

      </Tip>

huggingface / transformers