Closed xenova closed 1 month ago
Additionally, the total size only takes into account the first n-1
shards.
Thanks for the report @xenova ! The easiest solution would be to update the get_tensor_size
function in huggingface_hub library as it doesn't "work" with meta
tensor:
def get_tensor_size(tensor: "torch.Tensor") -> int:
return tensor.numel() * tensor.element_size()
In accelerate, we have the following for example:
def id_tensor_storage(tensor: torch.Tensor) -> Tuple[torch.device, int, int]:
"""
Unique identifier to a tensor storage. Multiple different tensors can share the same underlying storage. For
example, "meta" tensors all share the same storage, and thus their identifier will all be equal. This identifier is
guaranteed to be unique and constant for this tensor's storage during its lifetime. Two tensor storages with
non-overlapping lifetimes may have the same id.
"""
_SIZE = {
torch.int64: 8,
torch.float32: 4,
torch.int32: 4,
torch.bfloat16: 2,
torch.float16: 2,
torch.int16: 2,
torch.uint8: 1,
torch.int8: 1,
torch.bool: 1,
torch.float64: 8,
}
try:
storage_ptr = tensor.untyped_storage().data_ptr()
storage_size = tensor.untyped_storage().nbytes()
except Exception:
# Fallback for torch==1.10
try:
storage_ptr = tensor.storage().data_ptr()
storage_size = tensor.storage().size() * _SIZE[tensor.dtype]
except NotImplementedError:
# Fallback for meta storage
storage_ptr = 0
# On torch >=2.0 this is the tensor size
storage_size = tensor.nelement() * _SIZE[tensor.dtype]
return tensor.device, storage_ptr, storage_size
This way, we will have the state dict properly splitted with the right tensor size. Note that the state_dict will contain meta tensors. But, we update the state dict afterwards using get_state_dict_from_offload
(we can't do that before as the might not have enough storage on gpus+cpu because some layers are stored in the disk). LMK if this works for you @Wauplin !
Hi @SunMarc, thanks for the explanation. Could you open a PR to update https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/serialization/_torch.py to work with offloaded tensors? Updating get_torch_storage_size
with your suggestion shouldn't be too complex but I'm not sure to understand how save_torch_model
can be updated to stored all tensors. Is this something that leaves only in accelerate
?
but I'm not sure to understand how save_torch_model can be updated to stored all tensors. Is this something that leaves only in accelerate?
This is something that lives on transformers
. No changes required on huggingface-hub. I just wanted to explain how we were going to fill these meta tensor after sharding them
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
bump :)
Sorry for the delay @xenova! This should be fixed in the PR above
System Info
transformers
version: 4.44.2Who can help?
@SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Note: Tested on a A100 GPU w/ 40GB VRAM
After loading a large model (e.g., via
.from_pretrained
) withdevice_map='auto'
such that certain parts need to be offloaded to CPU, any following calls to serialize the model (e.g.,.save_pretrained
or.push_to_hub
) result in a model with result inn-1
correct shards, followed by 1 shard of the remaining weights.Take https://huggingface.co/google/gemma-2-27b-it for example, if running
it will produce 8 shards instead of the expected 12: The first 7 are of size ~5GB and the last is ~20GB. Also note that 7 * 5 = 35 < 40 (VRAM), meaning the first few were on the GPU when the model was serialized.
Expected behavior
All shards should be < MAX_SHARD_SIZE (defaults to 5GB)