Model saving (via `.save_pretrained` or `.push_to_hub`) produces inconsistent shard sizes when some weights are offloaded

xenova commented 2 months ago

System Info

transformers version: 4.44.2
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.5
Safetensors version: 0.4.4
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): 2.17.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@SunMarc

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Note: Tested on a A100 GPU w/ 40GB VRAM

After loading a large model (e.g., via .from_pretrained) with device_map='auto' such that certain parts need to be offloaded to CPU, any following calls to serialize the model (e.g., .save_pretrained or .push_to_hub) result in a model with result in n-1 correct shards, followed by 1 shard of the remaining weights.

Take https://huggingface.co/google/gemma-2-27b-it for example, if running

# pip install accelerate
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-27b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model.save_pretrained('output')

it will produce 8 shards instead of the expected 12: The first 7 are of size ~5GB and the last is ~20GB. Also note that 7 * 5 = 35 < 40 (VRAM), meaning the first few were on the GPU when the model was serialized.

Expected behavior

All shards should be < MAX_SHARD_SIZE (defaults to 5GB)

xenova commented 2 months ago

Additionally, the total size only takes into account the first n-1 shards.

SunMarc commented 2 months ago

Thanks for the report @xenova ! The easiest solution would be to update the get_tensor_size function in huggingface_hub library as it doesn't "work" with meta tensor:

def get_tensor_size(tensor: "torch.Tensor") -> int:
    return tensor.numel() * tensor.element_size()

In accelerate, we have the following for example:

def id_tensor_storage(tensor: torch.Tensor) -> Tuple[torch.device, int, int]:
    """
    Unique identifier to a tensor storage. Multiple different tensors can share the same underlying storage. For
    example, "meta" tensors all share the same storage, and thus their identifier will all be equal. This identifier is
    guaranteed to be unique and constant for this tensor's storage during its lifetime. Two tensor storages with
    non-overlapping lifetimes may have the same id.
    """
    _SIZE = {
        torch.int64: 8,
        torch.float32: 4,
        torch.int32: 4,
        torch.bfloat16: 2,
        torch.float16: 2,
        torch.int16: 2,
        torch.uint8: 1,
        torch.int8: 1,
        torch.bool: 1,
        torch.float64: 8,
    }
    try:
        storage_ptr = tensor.untyped_storage().data_ptr()
        storage_size = tensor.untyped_storage().nbytes()
    except Exception:
        # Fallback for torch==1.10
        try:
            storage_ptr = tensor.storage().data_ptr()
            storage_size = tensor.storage().size() * _SIZE[tensor.dtype]
        except NotImplementedError:
            # Fallback for meta storage
            storage_ptr = 0
            # On torch >=2.0 this is the tensor size
            storage_size = tensor.nelement() * _SIZE[tensor.dtype]

    return tensor.device, storage_ptr, storage_size

This way, we will have the state dict properly splitted with the right tensor size. Note that the state_dict will contain meta tensors. But, we update the state dict afterwards using get_state_dict_from_offload (we can't do that before as the might not have enough storage on gpus+cpu because some layers are stored in the disk). LMK if this works for you @Wauplin !

Wauplin commented 2 months ago

Hi @SunMarc, thanks for the explanation. Could you open a PR to update https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/serialization/_torch.py to work with offloaded tensors? Updating get_torch_storage_size with your suggestion shouldn't be too complex but I'm not sure to understand how save_torch_model can be updated to stored all tensors. Is this something that leaves only in accelerate?

SunMarc commented 2 months ago

but I'm not sure to understand how save_torch_model can be updated to stored all tensors. Is this something that leaves only in accelerate?

This is something that lives on transformers. No changes required on huggingface-hub. I just wanted to explain how we were going to fill these meta tensor after sharding them

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xenova commented 1 month ago

bump :)

SunMarc commented 1 month ago

Sorry for the delay @xenova! This should be fixed in the PR above

huggingface / transformers