[BUG] test_mmlu.py does not support MoE models.

Hello,

Here is the error I am facing:

python test_mmlu.py
 -- Loading dataset: cais/mmlu/anatomy...
/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/datasets/load.py:1429: FutureWarning: The repository for cais/mmlu contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/cais/mmlu
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
 -- Loading dataset: cais/mmlu/computer_security...
 -- Loading dataset: cais/mmlu/formal_logic...
 -- Loading dataset: cais/mmlu/logical_fallacies...
 -- Loading dataset: cais/mmlu/philosophy...
 -- Loading dataset: cais/mmlu/nutrition...
 -- Loading model: /home/tyra/storage/gpu-models/yi-34bx2-moe-60b/2.8bpw
Traceback (most recent call last):
  File "/home/tyra/files/ai/exllamav2/tests/test_mmlu.py", line 141, in <module>
    model, cache, tokenizer = get_model(model_base, variant, gpu_split, 1)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tyra/files/ai/exllamav2/tests/test_mmlu.py", line 61, in get_model
    model_.load(gpu_split_)
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/model.py", line 244, in load
    for item in f: return item
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/model.py", line 263, in load_gen
    module.load()
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/moe_mlp.py", line 56, in load
    self.post_attention_layernorm.load()
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/rmsnorm.py", line 23, in load
    w = self.load_weight()
        ^^^^^^^^^^^^^^^^^^
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/module.py", line 99, in load_weight
    tensor = self.load_multi(["weight"])["weight"]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tyra/files/ai/tabby/env/lib/python3.11/site-packages/exllamav2/module.py", line 75, in load_multi
    tensors[k] = st.get_tensor(self.key + "." + k).to(self.device())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Here are the config variables in the script:

model_base = "/home/tyra/storage/gpu-models/yi-34bx2-moe-60b"

variants = [
    "2.8bpw"
]

gpu_split = (20, 21.3, 24)

qa_set = "cais/mmlu"
qa_split = "test"

categories = \
[
    "anatomy",
    "computer_security",
    "formal_logic",
    "logical_fallacies",
    "philosophy",
    "nutrition",
]

examples_per_category = 3
questions_per_category = 97

This setup work perfectly fine with other non MoE models. Also, MoE models works with the test_inference.py script for perplexity evaluation.

turboderp / exllamav2

[BUG] test_mmlu.py does not support MoE models. #300