Closed bdambrosio closed 2 months ago
Sorry. Amazing what a code review for a post can do. Not sure why this is there: settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id]) but removing it seems to fix things...
I was just about to say, you probably don't want to disallow the EOS token for an instruct model.
Also, that repetition penalty seems rather high. I try to either avoid it altogether or use a much lower penalty like 1.01, since it indiscriminately penalizes tokens like punctuation which can throw some models off.
I'm doing my own quant using this: export CUDA_VISIBLE_DEVICES=0,1,2 python3 convert.py -i ../models/mixtral-8x22b-instruct-oh -o mixtral-8x22b-exl2 -cf mixtral-8x22b-exl2 -l 2048 -b 6.0 -hb 8 -ss 8192
Works fine with a number of smaller models, including Smaug-mixtral-7x22B.
However, with the bigger models, e.g. above, generated output seems to wander off after a couple hundred tokens. Exact same model Q6_K_M.gguf stays on track using same prompt.
I'm sure its something I'm doing wrong, but have no idea what. Ideas? Anything obvious I should try? (yes, I've played with temp and top_p, as well as prompt template variations, ad infinitum. Using prompt template as specified in mixtral-8x22b-instruct-oh tokenizer-config.json)
runtime: