How to avoid the peak RAM memory usage of a model when I want to load to GPU

JoanFM commented 10 months ago

System Info

transformers version: 4.36.2
Platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

I am using transformers to load a model into GPU, and I observed that before moving the model to GPU there is a peak of RAM usage that later gets unused. I assume the model is loaded into CPU before moving into GPU.

In GPU model takes around 4Gi and to load it I need more than 7Gi of RAM which seems weird.

Is there a way to load it direcly to the GPU without spending so much RAM?

I have tried with the low_cpu_mem_usage and device_map parameter to cuda and auto but no luck.

from transformers import AutoModel; m = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en", trust_remote_code=True, low_cpu_mem_usage=True, device_map="auto")

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoModel; m = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en", trust_remote_code=True, low_cpu_mem_usage=True, device_map="auto")

Expected behavior

Not having such a memory peak

ArthurZucker commented 10 months ago

Hey 🤗 thanks for opening an issue! I am not sure you can prevent CPU usage (transfer from SSD to GPU), not sure anything supports it. However device_map = "auto" should always allow you to load the model without going over the ram usage. The peak can come from torch. set_default_dtype(torch.float16) and the fact that you are not specifying a dtype. So the model might be loaded in float32, then casted then transfered.

JoanFM commented 10 months ago

so what would you actually suggest to do? What dtype parameter should I pass?

ArthurZucker commented 10 months ago

float16 or something like that. Or use TEI https://huggingface.co/docs/text-embeddings-inference/index

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JoanFM commented 9 months ago

Thanks for helping, it was indeed an issue with dtype

huggingface / transformers