Running humaneval against llama-3-8b-instruct exl2 quant results in a silent OOM when samples per task > 7

LlamaEnjoyer commented 3 weeks ago

Evaluation runs to the finish when -spt is set to 7 or less though (Windows 11, 64GB RAM, RTX 4070Ti Super16GB VRAM, ExllamaV2 0.1.4 from the dev branch). This happens with https://huggingface.co/turboderp/Llama-3-8B-Instruct-exl2/tree/6.0bpw quant.

Windows Event Viewer shows this:

Faulting application name: python.exe, version: 3.11.9150.1013, time stamp: 0x660bda91
Faulting module name: c10.dll, version: 0.0.0.0, time stamp: 0x66145ad7
Exception code: 0xc0000005
Fault offset: 0x0000000000063064
Faulting process id: 0x0x4FC
Faulting application start time: 0x0x1DAB8C46E059709
Faulting application path: C:\Users\xxx\AppData\Local\Programs\Python\Python311\python.exe
Faulting module path: C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\lib\c10.dll
Report Id: aee8a1bf-1ff5-46e8-aadd-120bb5d65edb
Faulting package full name: 
Faulting package-relative application ID:

However evaluating Mistral works just fine (Tested against this quant: https://huggingface.co/turboderp/Mistral-7B-v0.2-exl2/tree/6.0bpw)

Here's what I tried:

reinstalling exllamav2, pytorch, safetensors, tokenizers, numpy, flash-attention libraries
updating Nvidia drivers to the newest ones (I was running 552.22, now it's 555.99)
setting CUDA System Fallback Policy on and off
testing against a different LLM - Mistral 7Bv0.2 - and it worked without a hitch with -spt 10

List of installed packages in my system:

Package            Version
------------------ ------------
aiohttp            3.9.5
aiosignal          1.3.1
attrs              23.2.0
blinker            1.8.2
certifi            2024.2.2
charset-normalizer 3.3.2
click              8.1.7
colorama           0.4.6
cramjam            2.8.3
datasets           2.19.1
dill               0.3.8
einops             0.8.0
einops             0.8.0
exllamav2          0.1.4
fastparquet        2024.5.0
filelock           3.13.1
fire               0.6.0
flash-attn         2.5.9.post1
Flask              3.0.3
frozenlist         1.4.1
fsspec             2024.2.0
huggingface-hub    0.23.1
human-eval         1.0.3
idna               3.7
intel-openmp       2021.4.0
itsdangerous       2.2.0
Jinja2             3.1.3
markdown-it-py     3.0.0
MarkupSafe         2.1.5
mdurl              0.1.2
mkl                2021.4.0
mpmath             1.3.0
multidict          6.0.5
multiprocess       0.70.16
networkx           3.2.1
ninja              1.11.1.1
numpy              1.26.4
packaging          24.0
pandas             2.2.2
pillow             10.2.0
pip                24.0
pyarrow            16.1.0
pyarrow-hotfix     0.6
Pygments           2.18.0
pynvml             11.5.0
python-dateutil    2.9.0.post0
pytz               2024.1
PyYAML             6.0.1
regex              2024.5.15
requests           2.32.2
rich               13.7.1
safetensors        0.4.3
sentencepiece      0.2.0
setuptools         65.5.0
six                1.16.0
sympy              1.12
tbb                2021.11.0
termcolor          2.4.0
tokenizers         0.19.1
torch              2.3.1+cu121
torchaudio         2.3.1+cu121
torchvision        0.18.1+cu121
tqdm               4.66.4
typing_extensions  4.9.0
tzdata             2024.1
urllib3            2.2.1
waitress           3.0.0
websockets         12.0
Werkzeug           3.0.3
wheel              0.43.0
xxhash             3.4.1
yarl               1.9.4

LlamaEnjoyer commented 3 weeks ago

Also running evaluate_functional_correctness humaneval_output.json gives results such as {'pass@1': 0.0, 'pass@10': 0.0} for both Llama & Mistral.

However I narrowed it down to humaneval relying on signal.setitimer which is for UNIX systems only: {"task_id": "HumanEval/2", "completion": " return number - int(number)", "result": "failed: module 'signal' has no attribute 'setitimer'", "passed": false}

I guess no humnaneval results grading for us Windows folks unless going down the WSL2 road XD

turboderp commented 2 weeks ago

That looks like a valid result, so I assume you're right about the evaluation not working under Windows. Perhaps you could run it in WSL.

As for where it fails while running the test, I would suspect it's a bug in the tokenizers library, maybe? At least I can't think of another reason it while fail while creating the jobs, because there's nothing happening during that process that's at all different between Mistral and Llama3, except Mistral uses sentencepiece instead.

You should be able to use Mistral with a tokenizer.json file instead of tokenizer.model, which would cause ExLlama to rely on tokenizers instead of sentencepiece. That may still not fail, though, since Llama3 is a tiktoken model, so it's going to be a different code path regardless. I'll try and see if I can replicate the fault under Windows. Thanks for the detailed feedback.

LlamaEnjoyer commented 2 weeks ago

No problemo, it is I who should be thanking you for taking the time to troubleshoot this :) Some udpates:

I managed to fix the evaluation with the help from GPT4o by replacing the signal.setitimer with threading & multiprocessing modules (it's unrelated to your project, just FYI)
I also managed to reproduce this OOM with Mistral, it just took more than 10 samples per task. Actually it seems that it crashes with more than 25:

turboderp commented 2 weeks ago

You say OOM.. can you actually see memory usage go up in task manager or whatever as it's creating the jobs? It might be some sort of memory leak in PyTorch then.

turboderp commented 2 weeks ago

I've been experiencing some possibly related errors in an unrelated project, using the HF tokenizer for Sentencepiece models. I get inexplicable segfaults when I run more than about 80,000 encoding operations in a row, and this is without any ExLlama code. I don't seem to be getting the same errors when using non-Sentencepiece models. I think it might be related since most of what happens in the loop that fails for you is just calling the tokenizer thousands of times.

~~So, could you possibly try with a model that doesn't have a tokenizer.model file? Like maybe Llama3-8B? It could help narrow it down.~~ Sorry, just remembered you tried this already. :( Could be worth trying to downgrade tokenizers, though?

ipechman commented 12 hours ago

Same specs, same issue

turboderp / exllamav2

Running humaneval against llama-3-8b-instruct exl2 quant results in a silent OOM when samples per task > 7 #496