Loading pipeline in precision it was saved in

mvafin commented 1 month ago

Is your feature request related to a problem? Please describe. Currently, if torch_dtype is not specified, the pipeline defaults to loading in float32. This behavior causes float16 or bfloat16 weights to be upcast to float32 when the model is saved in lower precision, leading to increased memory usage. In scenarios where memory efficiency is critical (e.g., when exporting the model to another format), it’s important to load the model in the original precision specified in the safetensors file. Additionally, there’s currently no way to determine the dtype the model was saved in.

Describe the solution you'd like. A feature similar to torch_dtype="auto" in the transformers library would be helpful. This option allows models to be loaded with the dtype defined in their configuration. However, diffuser pipeline models generally lack a dtype specification in their configs. It is sometimes possible to use torch_dtype from text_encoder config, but not all pipelines have it and it is not clear if this is a reliable place to check the precision of the model.

Describe alternatives you've considered. A possible solution could be implementing a method to identify the model’s precision prior to calling from_pretrained, as the weights are accessible only after the model is downloaded inside from_pretrained and remain hidden from external access. This approach would allow users to set the appropriate torch_dtype for loading the model.

Additional context. This feature is relevant to optimum-cli use cases where model conversion or export to other formats must work within memory constraints. If there’s already a way to achieve this, guidance would be appreciated.

a-r-r-o-w commented 1 month ago

cc @DN6

mvafin commented 3 weeks ago

@DN6 @a-r-r-o-w Any recommendations how we could understand in what precision the model is saved?

a-r-r-o-w commented 3 weeks ago

There is no guaranteed way to know what dtype one should run inference in and you can only make assumptions, since a state dict can hold different parameters in different dtypes. For Diffusers models, typically the precision the transfomer/unet is saved in is the dtype to be used for inference.

You can find this dtype with something like:

state_dict = torch.load("path/to/state_dict.pt")  # or load using safetensors 

dtype = next(state_dict.items()).dtype

mvafin commented 2 weeks ago

@a-r-r-o-w We do not have access to safetensors file if we load models by calling from_pretrained or from_config for diffusers pipeline. How can we check the precision before loading the model in memory?

Can torch_dtype="auto" be implemented on diffusers side? It can check the precision of the weights internally and load model in right precision.

slyalin commented 1 week ago

@a-r-r-o-w, we are trying to export diffusers to OpenVINO in optimum in an economical way regarding allocated memory. In this case, it is important to avoid on-the-fly weight conversion and load a model in its original precision, which is one way to use mmap for weights for example, and have low requirements on RAM amount. Would you happen to have any plans to unlock that?

a-r-r-o-w commented 1 week ago

Gentle ping @DN6 @yiyixuxu

mvafin commented 1 week ago

Looks like we can use DiffusionPipeline.download for such use case. Any reasons this is not a good idea?

Instead of from_pretrained we do DiffusionPipeline.download, check the precision of safetensors in unet or transformer directory and then call from_pretrained with the correct torch_dtype.

yiyixuxu commented 5 days ago

I think we can implement auto dtype, but it is a little bit low prior right now

slyalin commented 4 days ago

I think we can implement auto dtype, but it is a little bit low prior right now

It would be a good alignment between transformers and diffusers and let us avoid this workaround: https://github.com/huggingface/optimum-intel/pull/1033. Hence @yiyixuxu mentioned it was a low priority, so I don't see a chance to avoid this workaround in the short term.

yiyixuxu commented 1 day ago

@slyalin

really sorry - 100% agreed it would be really good to have this feature! it was designed it in a way that the checkpoints name needs to reflect the precision, e.g. float16 should be saved as diffusion_pytorch_model.fp16.safetensors https://github.com/huggingface/diffusers/blob/c96bfa5c80eca798d555a79a491043c311d0f608/src/diffusers/models/attention.py#L190 - obviously, (we should have known better too), we would not be able to enforce this rule and with lower precision checkpoints getting more and more popular no one actually save them with the expected name pattern, we do need something like torch_dtype=auto;

with that being said, the team is really overwhelmed right now and we will work on this as soon as we have more bandwidth

huggingface / diffusers

Loading pipeline in precision it was saved in #9797