turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 255 forks source link

[ERROR] Worker (pid:25134) was sent SIGKILL! Perhaps out of memory? #556

Open UTSAV-44 opened 1 month ago

UTSAV-44 commented 1 month ago

Hi,turboderp!,

I am using A10 gpu with 24 gb ram for inferencing LLama3 .I am gunicorn with workers count 2 but It is giving Perhaps out of memory?.It is using 13 gb out of 24gb only ,but still showing Running out of VRAM

remichu-ai commented 1 month ago

I think worker as 2 will require double the memory requirement for GPU. 13*2>24

UTSAV-44 commented 1 month ago

I think worker as 2 will require double the memory requirement for GPU. 13*2>24

I have observed that 13gb is used when worker is 2.

turboderp commented 1 month ago

There is a known issue with safetensors that only shows up on some systems. Windows especially suffers from it, but I've seen it reported on some Linux systems as well. I think it has to do with memory mapping not working properly when you have too many files open at once, or something like that.

There is an option to bypass safetensors when loading models, which can be enabled with either -fst on the command line, setting the EXLLAMA_FASTTENSORS env variable, or setting config.fasttensors = True in Python.

UTSAV-44 commented 1 month ago

Does it depends on NVIDIA driver version and cuda version.At present Driver Version: 535.183.01 and CUDA Version: 12.2.We are running it on Ubuntu 22.04

turboderp commented 1 month ago

No, it's an issue with safetensors and/or possibly the OS kernel. Try using one of the options above to see if it helps.

UTSAV-44 commented 1 month ago

I tried with setting config.fasttensors = True ,but it does not work out.I tried using this in g4dn.xlarge instance but the model is not loading .

turboderp commented 1 month ago

Can you share the code that fails? The config option has to be set after config.prepare() is called but before model.load().

UTSAV-44 commented 1 month ago

config = ExLlamaV2Config(model_dir) config.fasttensors = True self.model = ExLlamaV2(config)

    self.cache = ExLlamaV2Cache_Q4(self.model, max_seq_len=256*96, lazy=True)  
    self.model.load_autosplit(self.cache, progress=True)

    print("Loading tokenizer...")
    self.tokenizer = ExLlamaV2Tokenizer(config)

    self.generator = ExLlamaV2DynamicGenerator(
        model=self.model,
        cache=self.cache,
        tokenizer=self.tokenizer,
    )

    self.generator.warmup()

I am running it on kubernetes with g5.xlarge gpu instance.

turboderp commented 1 month ago

I'm not sure there's any way to prevent PyTorch from using a lot of virtual memory. But just out of interest, what do you get from the following?

cat /proc/sys/vm/overcommit_memory
ulimit -v
UTSAV-44 commented 1 month ago

for the command cat /proc/sys/vm/overcommit_memory I got 1 ulimit -v I got unlimited

turboderp commented 1 month ago

I'm not sure about the implications actually, but I think you might want to try changing the overcommit mode.

sudo sysctl vm.overcommit_memory=0

or

sudo sysctl vm.overcommit_memory=2

:shrug:

brthor commented 2 weeks ago

@turboderp EDIT: The sigkill issue I detailed here was due to serialization of exllama state by the huggingface datasets.map() function when the exllama model is pre-initialized and is unrelated to exllama.

Reducing the cache size appeared to help because the cache state was being serialized.

If anyone else hits this issue, passing the new_fingerprint='some_rnd_str' to datasets.map() will prevent the serialization.