turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Reserved ram not freed? #347

Closed Celppu closed 2 weeks ago

Celppu commented 4 months ago

I have been trying to play with examples and maybe use this instead of llama cpp.

Problem:

After closing for example minima_chat.py model stays in ram? Space won´t get freed without restart. GPU memory is freed

System: Win 10, wsl 2, ubuntu 22

turboderp commented 4 months ago

Model should definitely not stay in RAM. You can call model.unload() to make sure any unmanaged resources are freed up, although those are limited to some really small structures of maybe a few kB in total.

Generally speaking, Python will hold on to any object in memory as long as it's referenced somewhere, so you have to make sure to clear any references to the model, cache, generator etc. if you want to fully unload the model. Even then, the garbage collector won't free the memory immediately (unless it needs to).

You can force garbage collection by calling gc.collect(), and torch.cuda.empty_cache() will clear PyTorch's tensor cache freeing up some VRAM for other processes. But this won't in principle give you more RAM or VRAM to work with within the PyTorch process.

As for VRAM, PyTorch will allocate a bunch of it for its own use when it's first imported and used, and you can't get this back without ending the process. It's a one-time allocation, though, so if the point is to unload a model and load another one, it shouldn't be an issue. If you want to load a model, do some work and then get back all the RAM and VRAM that PyTorch grabbed, you'll need to do all that in an isolated process due to how Torch is designed.

Celppu commented 4 months ago

For me closing minimal_chat wsl memory usage stays high and I need to to wsl -- shutdown to free up the memory. I tried to add this to to minimal chat but still ram is not freed. Might be wsl issue? `````



def exit_handler():
    global model
    print ('Closing!')
    model.unload()
    gc.collect()
    torch.cuda.empty_cache()

atexit.register(exit_handler)
turboderp commented 4 months ago

You have to make sure you get rid of all references to model. Both cache and generator will keep references (and cache will reserve a lot of VRAM on its own), so you have to clear those references as well before the gc.collect().

You can also try this function to list any tensors currently allocated by Torch. Sadly there's no way to identify the tensors except by their dimensions, but it should give you some idea anyway:

def list_live_tensors():

    tensors = {}
    gc.collect()
    torch.cuda.empty_cache()

    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                d = str(obj.size()) + ", " + str(obj.dtype) + ", " + str(obj.device)
                if d in tensors.keys():
                    tensors[d] += 1
                else:
                    tensors[d] = 1
        except:
            pass

    print("-----------")
    for k, v in tensors.items():
        print(f"{v} : {k}")
tau0-deltav commented 4 months ago

@Celppu What makes you think you're running out of memory? What are you measuring your system's memory with? That you're using WSL is kind of a hint that you've maybe got the wrong end of the stick here, no offense.

Can you describe some of the issues this has caused you?

turboderp commented 2 weeks ago

Closing this as stale. Note that the fasttensors config option is now supported on Windows, and while it isn't faster on Windows, it does fix some issues related to safetensors and memory-mapping specifically on Windows.