turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.76k stars 220 forks source link

Perplexity Data Format/Testing Data Question #41

Closed lhl closed 1 year ago

lhl commented 1 year ago

I was trying to do an apples-to-apple shootout on GPTQ vs the new llama.cpp k-quants (memory usage, speed, etc) but ran into a bump with perplexity. It looks like exllama loads a jsonl formatted version of wikitext-2's wiki.valid.raw (not the wiki.test.raw that is typically used for testing)?

Just wondering if there's a preformatted jsonl of the rest of wikitext-2 already. Is the format just literally chunking every line into a "text" object?

turboderp commented 1 year ago

It's literally just this dataset from HF, and I'm using a small portion of it (first 100 entries longer than 50 chars.) I didn't put much thought into it as the only purpose was to check that the models were working at all, and that they got more accurate with larger models.

So I'm not sure if there's a standard test set people have settled on or a standard sequence length etc. I just never got around to mimicking the exact procedure used in GPTQ-for-Llama etc. But from a quick glance at llama_eval() in llama.py from the main branch, it does seem like it's using the test split from wikitext2 (or ptb or c4), on 2048-token chunks of the whole split, concatenated. If I'm not mistaken.

lhl commented 1 year ago

OK, I've just done a first pass adding -ppl-raw that should chunk text and calculate perplexity in (I believe) somewhat standard way: https://github.com/turboderp/exllama/compare/master...lhl:exllama:master

I can submit a PR if you'd like, should be a pretty clean addition and I tried to adhere to the style of the rest of the file.

Model Params Quant Bits Groupsize Act-order -ppl wikitext-2 Notes
Manticore Chat Pyg 13b GPTQ 4 128 no 5.7170 6.1343
Nous Hermes 13b GPTQ 4 128 no 6.5777 6.8394
WizardLM Uncensored Supercot Storytelling 30b GPTQ 4 None no 6.9122 7.958
Guanaco SuperCOT 30b GPTQ 4 128 no 5.0738 5.3853
Guanaco 33B Act Order 30b GPTQ 4 None no 4.9240 5.0438
Guanaco 33B Act Order 30b GPTQ 4 None yes 4.9245 5.0436 This is an act-order quantize, but basically no score difference

I tried a few models I had laying around. BTW, for act-order, it didn't seem to read the config.json correctly, I had to just force it in code, but surprisingly, there was almost no difference in accuracy.

You can see that using the whole wikitext-2-raw test file, perplexity scores are worse across the board than w/ the testdata.jsonl, but it's probably more representative (since it's not just predicting the last word, but a randomized based on the token chunking). They do seem to line up though relative to each other.

turboderp commented 1 year ago

It isn't just looking at the last word normally, it does cross entropy on all tokens in the test sequence. Causal attention ensures that processing the sequence gives the same logits at each position as predicting all the tokens individually.

Of course, the reason I didn't just concatenate all the examples together like that is that you end up asking it to predict a whole sequence that's pieced together from unrelated articles which I don't think really tells you much about the model. But, it does look like this is the standard way to do it, so we just need to make sure it's the same dataset/split as everyone else uses. I take it's this one, specifically? Did you build the raw file yourself from a dataloader or is there a jsonl file floating around somewhere?

I'm not sure about the overlap. Going by GPTQ-for-LLaMa, it seems like by default it's grabbing 128 samples of 2048 tokens each with no overlap:

for i in range(nsamples):
    batch = testenc[:, (i * model.seqlen):((i + 1) * model.seqlen)].to(dev)
    try:
        model(batch)

Unless overlap is more standard than that, idk.

As for the act-order stuff, it's still unclear to me how that Guanaco model is act-order. It doesn't have a group index, which means that even if it is act-order there's no way to use it any differently than a no-act-order model. The value in the ExLlamaConfig isn't being used for evaluation, it's just there to report what the model has been identified as, based on the presence of the group index. Forcing a value has no effect on how the forward pass goes.

Mind you, this is all made by reverse engineering since it isn't exactly a well-documented format, so it's possible I overlooked some subtle detail in how you're supposed to treat an act-order model. But I'm not seeing it in any of the other implementations. They all just use the group index if it's there.

lhl commented 1 year ago

Good to know about how the sequences work. I'm not an ML guy and my eyes just glaze over whenever I start seeing math formulas, so probably missing a lot of the finer details on how things work.

For data, I'm using the original wikitext-2-raw dataset from here: https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Here's what llama.cpp is doing: https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/perplexity.cpp#L24 - it looks like they default to 512 for the context and batch size without overlap - so maybe there's not actually a standard, I thought I saw it in https://huggingface.co/docs/transformers/perplexity but that's not it, so maybe that was a lost tab to hunt down.

turboderp commented 1 year ago

It's weird if there isn't some sort of de facto standard, so researchers wouldn't have to write tests themselves for every model they want to compare their work against. Perplexity/loss is a meaningless figure in isolation. Also why finetunes do so poorly compared to the base model. It's not that they're worse in general, they're just worse at predicting Wikipedia.

All in all I don't know. The main thing is to identify what's commonly used and compare against that, ideally getting the same results for any given model as GPTQ-for-LLaMa. Cause, even if everything ExLlama does is mathematically equivalent to the other implementations, there are subtle mistakes you can make and not really notice without a comparative benchmark.

For instance in normalizing the hidden state, if I calculate the norm in FP16 instead of FP32, I get significant rounding errors and garbage output because FP16 just doesn't have enough precision for the intermediate result. That's easy to notice, but I fear there could be somewhere else I'm getting slightly incorrect output, which could have a cascading effect through the forward pass and lead to output that's not so much worse that the model starts speaking in tongues, but maybe off by just enough to make it perform worse than it should.

lhl commented 1 year ago

Yeah, I agree the lack of standardization is a bit maddening, but seems to reflect the state of the AI world (all the evals published in the various papers aren't very replicable/comparable). My original desire for better perplexity measurements was sparked by the huge number of new quantizations just dropped in llama.cpp (and there's SPQR and AWQ as well) - it'd be nice to have a baseline to compare how they do, but I agree having something that matches GPTQ-for-LLaMa's numbers might be a good start, if only from a pure sanity-checking perspective.

I'll do a bit of reworking if you're interested. Maybe a more generalized perplexity function that will have a -ppl-ds (--perplexity-dataset) and -ppl would take an arg (eg default, gptq-for-llama, llama.cpp) that would support all the different chunking and format types to replicate what we'd want to compare against?

bkutasi commented 1 year ago

It's weird if there isn't some sort of de facto standard, so researchers wouldn't have to write tests themselves for every model they want to compare their work against.

I always wondered this, but I think this could help a lot: https://github.com/stanford-crfm/helm. Seems fairly straightforward. Also the standard stuff like ARC, HellaSwag, MMLU, TruthfulQA could help a lot even tho they are fair compute intensive.

lhl commented 1 year ago

evals in general is totally not straightforward even from things that should be, here's a recent example pointing out how broken MMLU results are from HuggingFace's leaderboard: https://twitter.com/Francis_YAO_/status/1666833311279517696 - in general, none of the published results from papers, ever match for testing btw, and even things like lm-eval that are supposed to be replicable tend to be broken - eg, I was trying to harmonize w/ Fabrice Bellard's TextSynth server results, but it turns out CoQA has been broken on HF models since 2021: https://github.com/EleutherAI/lm-evaluation-harness/issues/238

The reason for perplexity is a much simpler thing - just to be able to get an idea of the accuracy loss from different quantizes, but small changes in the chunking actually appear to lead to very different results...

turboderp commented 1 year ago

I'll do a bit of reworking if you're interested. Maybe a more generalized perplexity function that will have a -ppl-ds (--perplexity-dataset) and -ppl would take an arg (eg default, gptq-for-llama, llama.cpp) that would support all the different chunking and format types to replicate what we'd want to compare against?

I think comparing against GPTQ and llama.cpp sounds very sensible. I wouldn't mind a PR for that, if you can nail down exactly what they're doing and run an equivalent benchmark for ExLlama.

Of course, one problem right now is that there are a bunch of different code paths that really all need to be tested. But I can add that in as long as the basic test is there for any one of the modes.

bkutasi commented 1 year ago

Hey, thanks for bringing this up. I did some more digging and have to say my prev comment was pretty uninformed. Evaluating LLMs can be tricky, and there are often inconsistencies between published results and actual testing. Especially when the methodology is changed a bit or not open sourced. The leaderboard on HF is very suspicious for having broken scores all over the place.

lhl commented 1 year ago

I think comparing against GPTQ and llama.cpp sounds very sensible. I wouldn't mind a PR for that, if you can nail down exactly what they're doing and run an equivalent benchmark for ExLlama.

Of course, one problem right now is that there are a bunch of different code paths that really all need to be tested. But I can add that in as long as the basic test is there for any one of the modes.

OK, I did a refactor today that creates a perplexity module that handles the "default" method with the JSONL the same (validated the results with a bunch of runs), but will also (based on file extension) handle raw data in the loader flexibly. I will continue working on the gptq-for-llama and llama.cpp equivalents over the next few days and then figure out rebasing, sinc it looks like there are some fairly extensive changes to the test_benchmark_inference.py in your recent commit.

Here's what the perplexity module looks like: https://github.com/lhl/exllama/commit/809ddbc744c8064c65f34c19b19e8ce48e414102

(Ideally it'd be nice to have a single object to pass around for the major llm bits but I just ended up passing model/cache/tokenizer like generator does and copying the functions I use as internal methods)

turboderp commented 1 year ago

Those changes aren't as bad as github makes them look. :) I just removed the with torch.no_grad(): scope since there's no point pretending the model supports gradients anyway. They get disabled at the start of the forward pass instead now. But that does shift the indentation for the entire function because Python be like that. I think if you just do the same on your end the diff should end up quite small.

Passing a single object around isn't really ideal, since the model, cache and tokenizer are intentionally separate. You can run the same model with multiple caches, for instance, and the model has no concept of text strings so the tokenizer is never relevant to it.

Moving the perplexity texts to a separate module is fine, of course. I also plan on adding kernel benchmarks, so it makes sense to split all the tests up at some point.

lhl commented 1 year ago

just rebased and submitted wip pr just to avoid moving target issues, forgot to link issue in comment so: https://github.com/turboderp/exllama/pull/45

turboderp commented 1 year ago

It looks good. And -ppl still works as is so I just merged it for now.

turboderp commented 1 year ago

Cleaning up a bit, and considering this closed for now, but feel free to reopen if necessary.

lhl commented 1 year ago

k, got a little busy but planning on revisiting this tonight. I'll just not that it looks like GPTQ-for-LLaMa calculates perplexity as part of its benchmark function in a pretty weird way, just wondering if you looked at it or could validate the ppl numbers match?

The example in their readme tells you do do something like --benchmark 2048 --check but when you load the wikitext2 dataset for example (c4, wikitext2 a few others are built in for testing) it gives you a warning: "Token indices sequence length is longer than the specified maximum sequence length for this model (2874559 > 2048). Running this sequence through the model will result in indexing errors"

If I'm reading the code right it pulls the first args.benchmark # of tokens as the input_ids to benchmark, and then it moves the mask around through the range for each token. The loop was a bit dense so I was going to step through and double check exactly what it was doing and making sure that I can get the matching perplexity outputs on a few different benchmarks/models (doesn't help that GPTQ-for-LLaMA is dog slow)...

turboderp commented 1 year ago

As far as I can see that warning is kind of meaningless. It treats the test data and training data differently, and the test data is just tokenized and split into 2048-token chunks. Then it computes cross-entropy on 128 chunks and calculates perplexity from that. The mode I added (-ppl gptq-for-llama) seems to emulate that correctly with the chunking code in perplexity.py, or at least it's close enough, maybe give or take the last token in each chunk or something. It does something more with C4, though, so I haven't added that in yet.

But I did some quick tests on Wikitext2 on some base Llama models that I have the logs for, and the results are roughly the same as the eval loss GPTQ-for-LLaMA originally calculated when the models were converted. Perplexity is slightly worse for most models, slightly better in a few cases, and the larger the models get the smaller the difference. It's all pretty much as I would expect.

lhl commented 1 year ago

Ah ok, if it matches, great. I'm testing on a 13b model that I have GTPQ and GGML for and on my end, it doesn't seem to quite match on my end.

GPTQ-for-LLaMA:cuda gives me:

CUDA_VISIBLE_DEVICES=0 python llama.py /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ --wbits 4 --groupsize 128 --load /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/Manticore-13B-Chat-Pyg-GPTQ-4bit-128g.no-act-order.safetensors wikitext2 --benchmark 2048 --check
PPL: 6.309726238250732
max memory(MiB): 8680.1416015625

and exllama:master (HEAD) gives me:

CUDA_VISIBLE_DEVICES=0 python -W ignore::UserWarning: test_benchmark_inference.py -d /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ -ppl gptq-for-llama -ppl-ds /data/ai/datasets/wikitext-2-raw/wiki.test.raw 
 -- Tokenizer: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/tokenizer.model
 -- Model config: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/config.json
 -- Model: /data/ai/models/llm/manticore/manticore-13b-chat-pyg-GPTQ/Manticore-13B-Chat-Pyg-GPTQ-4bit-128g.no-act-order.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity', 'perplexity_dataset']
 ** Time, Load model: 0.99 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,873.52 MB
 -- Loading dataset...
 -- Testing 128 chunks.............
 ** Perplexity: 5.5722

6.30 vs 5.57 - does it look like I'm doing anything obviously wrong with my settings?

turboderp commented 1 year ago

I searched around for the exact dataset for a bit, but in the end I decided to just download it from HF using a modified version of the GPTQ-for-LLaMA code. They also stick the examples together with two linefeeds for some reason. But try this:

cd datasets
python download_datasets.py

And then just run with "-ppl gptq-for-llama". If you don't specify a dataset it will load the downloaded datasets/wikitext2.txt and use that. Also I think the relevant argument for GPTQ-for-LLaMA is -eval, not -benchmark. -benchmark uses the training split if I remember correctly.

Of course it's still possible you get different results just because some models are apparently just extra sensitive to small numerical differences.

lhl commented 1 year ago

I was getting empty results w/ -eval but I think I see the issue, at least on the cuda branch it only outputs if using the the full model, for some reason. I've been going through some conversion shenanigans the past couple hours. Does exllama support an fp16 format or only quants? (My conversions ended up always with multiple safetensor files) I'm now doing tests w/ llama-7b, doing my own conversions from the base FB checkpoint data? I believe I've matched llama.cpp's perplexity settings, but hoping to do fp16 comparisons to make sure the results make sense/are close.