Closed JoanFM closed 9 months ago
Hey 🤗 thanks for opening an issue! I am not sure you can prevent CPU usage (transfer from SSD to GPU), not sure anything supports it. However device_map = "auto" should always allow you to load the model without going over the ram usage.
The peak can come from torch. set_default_dtype(torch.float16)
and the fact that you are not specifying a dtype. So the model might be loaded in float32, then casted then transfered.
so what would you actually suggest to do? What dtype
parameter should I pass?
float16
or something like that. Or use TEI https://huggingface.co/docs/text-embeddings-inference/index
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Thanks for helping, it was indeed an issue with dtype
System Info
transformers
version: 4.36.2Who can help?
I am using transformers to load a model into GPU, and I observed that before moving the model to GPU there is a peak of RAM usage that later gets unused. I assume the model is loaded into CPU before moving into GPU.
In GPU model takes around 4Gi and to load it I need more than 7Gi of RAM which seems weird.
Is there a way to load it direcly to the GPU without spending so much RAM?
I have tried with the
low_cpu_mem_usage
anddevice_map
parameter tocuda
andauto
but no luck.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Not having such a memory peak