Open LlamaEnjoyer opened 3 weeks ago
Also running evaluate_functional_correctness humaneval_output.json
gives results such as
{'pass@1': 0.0, 'pass@10': 0.0}
for both Llama & Mistral.
However I narrowed it down to humaneval relying on signal.setitimer which is for UNIX systems only:
{"task_id": "HumanEval/2", "completion": " return number - int(number)", "result": "failed: module 'signal' has no attribute 'setitimer'", "passed": false}
I guess no humnaneval results grading for us Windows folks unless going down the WSL2 road XD
That looks like a valid result, so I assume you're right about the evaluation not working under Windows. Perhaps you could run it in WSL.
As for where it fails while running the test, I would suspect it's a bug in the tokenizers
library, maybe? At least I can't think of another reason it while fail while creating the jobs, because there's nothing happening during that process that's at all different between Mistral and Llama3, except Mistral uses sentencepiece
instead.
You should be able to use Mistral with a tokenizer.json file instead of tokenizer.model, which would cause ExLlama to rely on tokenizers
instead of sentencepiece
. That may still not fail, though, since Llama3 is a tiktoken model, so it's going to be a different code path regardless. I'll try and see if I can replicate the fault under Windows. Thanks for the detailed feedback.
No problemo, it is I who should be thanking you for taking the time to troubleshoot this :) Some udpates:
I managed to fix the evaluation with the help from GPT4o by replacing the signal.setitimer
with threading
& multiprocessing
modules (it's unrelated to your project, just FYI)
I also managed to reproduce this OOM with Mistral, it just took more than 10 samples per task. Actually it seems that it crashes with more than 25:
You say OOM.. can you actually see memory usage go up in task manager or whatever as it's creating the jobs? It might be some sort of memory leak in PyTorch then.
I've been experiencing some possibly related errors in an unrelated project, using the HF tokenizer for Sentencepiece models. I get inexplicable segfaults when I run more than about 80,000 encoding operations in a row, and this is without any ExLlama code. I don't seem to be getting the same errors when using non-Sentencepiece models. I think it might be related since most of what happens in the loop that fails for you is just calling the tokenizer thousands of times.
So, could you possibly try with a model that doesn't have a tokenizer.model file? Like maybe Llama3-8B? It could help narrow it down. Sorry, just remembered you tried this already. :( Could be worth trying to downgrade tokenizers
, though?
Same specs, same issue
Evaluation runs to the finish when -spt is set to 7 or less though (Windows 11, 64GB RAM, RTX 4070Ti Super16GB VRAM, ExllamaV2 0.1.4 from the dev branch). This happens with https://huggingface.co/turboderp/Llama-3-8B-Instruct-exl2/tree/6.0bpw quant.
Windows Event Viewer shows this:
However evaluating Mistral works just fine (Tested against this quant: https://huggingface.co/turboderp/Mistral-7B-v0.2-exl2/tree/6.0bpw)
Here's what I tried:
List of installed packages in my system: