turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Very slow mystery memory leak #187

Closed Midaychi closed 1 year ago

Midaychi commented 1 year ago

Using 0cc4m's branch kobold ai, using exllama to host a 7b v2 worker. Over the span of thousands of generations the vram usage will gradually increase by percents until oom (or in newer drivers, shared memory bloat)

Have to kill out of python and restart to empty vram; there's a bunch of mystery vram usage left over even after model is onloaded. (This second part I know also happens in text generation ui)

Was suggested it was context growth, but I made sure to test it a few times by loading it with a full buffer max context and the memory is fairly stable (for self inferencing on a small scale at least)

shrug? Not a very helpful error report sorry.

turboderp commented 1 year ago

Has this behavior changed at all with any of the updates in the past two weeks? Context shouldn't matter, but there's a possibility it's down to memory fragmentation. Very hard thing to debug, though.

Midaychi commented 1 year ago

The only way I could use exllama on horde was with Occam's koboldai branch, and he's been busy on other projects, and Henky decided to drop plans to officially support exllama in the united branch. Exllama is also banned on kobold horde now and workers spotted running it get put into maintenance. Shrug

So I suppose this issue is no longer relevant.

I'll close this issue and I'll test exllama in text-generaton-webui, and I'll reopen it if I can reproduce this behavior in any usefully repeatable fashion.