Open mvafin opened 1 month ago
cc @DN6
@DN6 @a-r-r-o-w Any recommendations how we could understand in what precision the model is saved?
There is no guaranteed way to know what dtype one should run inference in and you can only make assumptions, since a state dict can hold different parameters in different dtypes. For Diffusers models, typically the precision the transfomer/unet is saved in is the dtype to be used for inference.
You can find this dtype with something like:
state_dict = torch.load("path/to/state_dict.pt") # or load using safetensors
dtype = next(state_dict.items()).dtype
@a-r-r-o-w We do not have access to safetensors
file if we load models by calling from_pretrained
or from_config
for diffusers pipeline. How can we check the precision before loading the model in memory?
Can torch_dtype="auto"
be implemented on diffusers side? It can check the precision of the weights internally and load model in right precision.
@a-r-r-o-w, we are trying to export diffusers to OpenVINO in optimum in an economical way regarding allocated memory. In this case, it is important to avoid on-the-fly weight conversion and load a model in its original precision, which is one way to use mmap for weights for example, and have low requirements on RAM amount. Would you happen to have any plans to unlock that?
Gentle ping @DN6 @yiyixuxu
Looks like we can use DiffusionPipeline.download
for such use case. Any reasons this is not a good idea?
Instead of from_pretrained
we do DiffusionPipeline.download
, check the precision of safetensors in unet
or transformer
directory and then call from_pretrained
with the correct torch_dtype
.
I think we can implement auto
dtype, but it is a little bit low prior right now
I think we can implement auto dtype, but it is a little bit low prior right now
It would be a good alignment between transformers and diffusers and let us avoid this workaround: https://github.com/huggingface/optimum-intel/pull/1033. Hence @yiyixuxu mentioned it was a low priority, so I don't see a chance to avoid this workaround in the short term.
@slyalin
really sorry - 100% agreed it would be really good to have this feature!
it was designed it in a way that the checkpoints name needs to reflect the precision, e.g. float16 should be saved as diffusion_pytorch_model.fp16.safetensors
https://github.com/huggingface/diffusers/blob/c96bfa5c80eca798d555a79a491043c311d0f608/src/diffusers/models/attention.py#L190 - obviously, (we should have known better too), we would not be able to enforce this rule and with lower precision checkpoints getting more and more popular no one actually save them with the expected name pattern, we do need something like torch_dtype=auto
;
with that being said, the team is really overwhelmed right now and we will work on this as soon as we have more bandwidth
Is your feature request related to a problem? Please describe. Currently, if
torch_dtype
is not specified, the pipeline defaults to loading infloat32
. This behavior causesfloat16
orbfloat16
weights to be upcast tofloat32
when the model is saved in lower precision, leading to increased memory usage. In scenarios where memory efficiency is critical (e.g., when exporting the model to another format), it’s important to load the model in the original precision specified in the safetensors file. Additionally, there’s currently no way to determine the dtype the model was saved in.Describe the solution you'd like. A feature similar to
torch_dtype="auto"
in the transformers library would be helpful. This option allows models to be loaded with the dtype defined in their configuration. However, diffuser pipeline models generally lack a dtype specification in their configs. It is sometimes possible to usetorch_dtype
fromtext_encoder
config, but not all pipelines have it and it is not clear if this is a reliable place to check the precision of the model.Describe alternatives you've considered. A possible solution could be implementing a method to identify the model’s precision prior to calling
from_pretrained
, as the weights are accessible only after the model is downloaded insidefrom_pretrained
and remain hidden from external access. This approach would allow users to set the appropriatetorch_dtype
for loading the model.Additional context. This feature is relevant to
optimum-cli
use cases where model conversion or export to other formats must work within memory constraints. If there’s already a way to achieve this, guidance would be appreciated.