turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

mixtral-8x22b / command-r generation wanders off. #410

Closed bdambrosio closed 2 months ago

bdambrosio commented 2 months ago

I'm doing my own quant using this: export CUDA_VISIBLE_DEVICES=0,1,2 python3 convert.py -i ../models/mixtral-8x22b-instruct-oh -o mixtral-8x22b-exl2 -cf mixtral-8x22b-exl2 -l 2048 -b 6.0 -hb 8 -ss 8192

Works fine with a number of smaller models, including Smaug-mixtral-7x22B.

However, with the bigger models, e.g. above, generated output seems to wander off after a couple hundred tokens. Exact same model Q6_K_M.gguf stays on track using same prompt.

I'm sure its something I'm doing wrong, but have no idea what. Ideas? Anything obvious I should try? (yes, I've played with temp and top_p, as well as prompt template variations, ad infinitum. Using prompt template as specified in mixtral-8x22b-instruct-oh tokenizer-config.json)

runtime:

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model)
# Initialize generator
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Settings
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.1 # default - overriden at uvicorn endpoint
settings.top_k = 50
settings.top_p = 0.8 # default - overriden at uvicorn endpoint
settings.token_repetition_penalty = 1.15
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

# Make sure CUDA is initialized so we can measure performance
generator.warmup()

async def get_stream(request: Request):
    global stop_gen
    global generator, settings, tokenizer
    query = await request.json()
    print(f'request: {query}')
    message_j = query
    ... grab run conditions from query ...
    generator.set_stop_conditions(stop_conditions)
    generator.begin_stream(input_ids, settings)
    return StreamingResponse(stream_data(query, max_new_tokens = max_tokens, stop_on_json=stop_on_json))
bdambrosio commented 2 months ago

Sorry. Amazing what a code review for a post can do. Not sure why this is there: settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id]) but removing it seems to fix things...

turboderp commented 2 months ago

I was just about to say, you probably don't want to disallow the EOS token for an instruct model.

Also, that repetition penalty seems rather high. I try to either avoid it altogether or use a much lower penalty like 1.01, since it indiscriminately penalizes tokens like punctuation which can throw some models off.