turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Hangs after reboot caused by TrippleFault. #225

Closed SolsticeProjekt closed 11 months ago

SolsticeProjekt commented 11 months ago

So, I had a BSOD trying to load open_llama_3b_v2-8k-GPTQ, but I'm not convinced that has anything to do with it. There was no error code in the BSOD that I recall, but the reason given was MEMORY_MANAGEMENT.

Now, when I start example_basic, regardless of the model I try to run, it does nothing. When I interrupt using CTRL+C, I get this:

Traceback (most recent call last): File "###\exllama\example_basic.py", line 1, in <module> from model import ExLlama, ExLlamaCache, ExLlamaConfig File "###\exllama\model.py", line 12, in <module> import cuda_ext File "###\exllama\cuda_ext.py", line 43, in <module> exllama_ext = load( ^^^^^ File "###\Lib\site-packages\torch\utils\cpp_extension.py", line 1301, in load return _jit_compile( ^^^^^^^^^^^^^ File "###\Lib\site-packages\torch\utils\cpp_extension.py", line 1538, in _jit_compile baton.wait() File "###\Lib\site-packages\torch\utils\file_baton.py", line 42, in wait time.sleep(self.wait_seconds) KeyboardInterrupt

In task manager, it's stuck at 0% CPU and uses 250megs of RAM.

That, sadly, is all I have. Google wasn't helpfull at all and my assumptions are most likely useless.

Any help is appreciated.

Edit: It appears to be stuck at the imports already. Specifically, it gets stuck at this line: "from model import ExLlama".

EyeDeck commented 11 months ago

Try deleting your torch extension cache, its location varies by OS, I think defaults are: %localappdata%\torch_extensions\torch_extensions\Cache\py3[x]_cu[y]\exllama_ext\ (Windows) ~/.cache/torch_extensions/py3[x]_cu[y]/exllama_ext/ (Linux) which varies based on Python and CUDA versions, where e.g. 3.11 and CUDA 12.1 = py311_cu121 I've had that happen too but it was from ctrl+Cing as I launched ExLlama, messed up something in the cache.

turboderp commented 11 months ago

I've experienced this as well, on Linux. It can happen if the process fails while Torch is building the extension, which can randomly leave the extension cache in an invalid state that Torch is unable to recover from. I'm not sure where the cache is stored on Windows, and it probably depends if it's native or WSL, but search for a folder named exllama_ext containing files like q4_matrix.cuda.o and build.ninja. Then delete the whole folder. On the next run ExLlama will take a little while to start as it rebuilds the extension.

You can also change verbose = False to verbose = True at the top of cuda_ext.py, which will give you a bunch of output from that build process. If startup still hangs this might help pin down where it's getting stuck.

SolsticeProjekt commented 11 months ago

Thanks to both of you. Deleting the respective cache folder did the trick. That also means that CUDA, in this case 12.1, caused the tripplefault.

On the next run ExLlama will take a little while to start as it rebuilds the extension.

Indeed. After deleting the folder, it took 24 seconds to run. Next run was around 8 seconds.

Thanks! (also you're all awesome for being smarter than me, I wish I could catch up. Holy shit, it's so much!)