turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Possible to load model with low system ram? #245

Open gros87 opened 1 year ago

gros87 commented 1 year ago

Hi,

I'm curious if it's possible to load a model if you don't have enough system ram, but enough vram. I got 32gb of system ram and 48gb of vram, unfortunatly I'm not able to load a 65b model... I get an error like: RuntimeError: unable to mmap 33484977464 bytes

Is there a way to avoid loading into system ram first? If not where would I need to look?

Empor-co commented 1 year ago

Adding another swap worked for me.

turboderp commented 1 year ago

This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit at once, it expects to be able to read the whole thing. So it looks like loading a .safetensors file larger than your system RAM just isn't possible right now. I'm going to look into options for allowing sharded model files.

Narsil commented 10 months ago

Not really safetensors limitation in the format, more about internals of torch limiting mmaping.

More info here if you want to bypass those limitations. https://github.com/huggingface/safetensors/issues/373#issuecomment-1829513862

erikschul commented 10 months ago

@Narsil Thanks for the technical details. If this is something you or @turboderp would like to support, maybe you could create an issue at the pytorch repo? I would do it, but I see that they already have +12k issues, so it would probably be overlooked without a precise technical description of what the issue is. It might be useful for other libraries to have this mechanism as well, if it's missing in pytorch.