Open Qubitium opened 8 hours ago
Hey @Qubitium, the model was indeed serialized as bf16, but here you're not specifying in which dtype you would like to load it.
We follow torch's default loading mechanism, which is to automatically load it in the default torch.dtype
(here, fp32) so as to be compatible with all hardwares and setups.
In order to update the dtype in which it should be loaded, please change this line:
- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype=torch.bfloat16)
You can also use 'auto'
so as to respect the dtype of the weights themselves:
- model = AutoModelForCausalLM.from_pretrained(model_file)
+ model = AutoModelForCausalLM.from_pretrained(model_file, torch_dtype='auto')
You can read more about this in the from_pretrained
documentation which I am pasting below:
@LysandreJik It's 2024 and I would like to propose that the default float32 be modified. Please read the below with a light heart.
Reasons:
AutoModelForCausalLM
has auto in name, but it's only auto, sometimes. When? We don't know. from_pretrained
honor and read model properties by default in config.json
but not dtype
in said json. AutoModelForCausalLM
and the api returns by default fp32. Assuming cpu/gpu device is compatible with config.dtype
, why?dtype=auto
. It reads from config.json
, first, then does auto. What does auto mean in this context if it reads from config? Overall, accept the config.json default as truth unless there is an override, or the default is really in-comptible with gpu/cpu: when a device does not physically support it model specified dtype.
torch_dtype (`str` or `torch.dtype`, *optional*):
Override the default `torch.dtype` and load the model under a specific `dtype`. The different options
are:
1. `torch.float16` or `torch.bfloat16` or `torch.float`: load in a specified
`dtype`, ignoring the model's `config.torch_dtype` if one exists. If not specified
- the model will get loaded in `torch.float` (fp32).
2. `"auto"` - A `torch_dtype` entry in the `config.json` file of the model will be
attempted to be used. If this entry isn't found then next check the `dtype` of the first weight in
the checkpoint that's of a floating point type and use that as `dtype`. This will load the model
using the `dtype` it was saved in at the end of the training. It can't be used as an indicator of how
the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32.
3. A string that is a valid `torch.dtype`. E.g. "float32" loads the model in `torch.float32`, "float16" loads in `torch.float16` etc.
<Tip>
For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or
reach out to the authors and ask them to add this information to the model's card and to insert the
`torch_dtype` entry in `config.json` on the hub.
</Tip>
System Info
Ubuntu 24.04 Transformers 4.46.2 Accelerator 1.1.1 Safetensor 0.4.5
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Unexpected 2x cpu memory usage due to bf16 safetensor loaded as float32 on
device=cpu
.Manually passing torch_dtype=torch.bfloat16 has no such issue but this should not be necessary since both model.config and safentensor files has proper bfloat16.
Sample reproducing code:
Code output:
Expected behavior
Modify above code pass
torch_dtype=torch.bfloat16
tofrom_pretrained
and memory usage is normal/expected:There are two related issues here:
Manually passing dtype=bfloat16 to
from_pretrained
fixes this issue.