Reducing generating response (40 token/s to 10 token/s) using chat.py

tednas commented 5 months ago

Hi @turboderp, thanks for the great tool.

For a use case of 1000 prompts, I am experimenting with two scripts:

chat.py
inference.py

using model: dolphin-2.6-mistral-7B-GPTQ args.mode = "chatml"

While the chat.py producing great quality response, by increasing number of prompts iterates, the response time is reducing drastically, below I have provided the response time.

Observation1: Resetting cache using cache.current_seq_len = 0 has no impact Observation2: VRAM memory usage is stable ~9 GB about 60% usage

while True:

    # Hard coding for customized input rather than interactive with consol:
    print_timings = True
    contexts = "This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce"
    up = """Extract all the edible items from the given context below.\nThen categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"\nOnly list them without any extra explanation.""" + "\nContext=" + context

    # print()
    # up = input(col_user + username + ": " + col_default).strip()
    # print()

    # Add to context
    user_prompts.append(up)
    # Send tokenized context to generator

    active_context = get_tokenized_context(model.config.max_seq_len - min_space_in_context)
    generator.begin_stream(active_context, settings)
      ...
      ...

Response:

(Response: 47 tokens, 40.03 tokens/second)
(Response: 53 tokens, 38.62 tokens/second)
(Response: 53 tokens, 37.12 tokens/second)
(Response: 53 tokens, 35.53 tokens/second)
(Response: 53 tokens, 34.28 tokens/second)
(Response: 53 tokens, 32.81 tokens/second)
(Response: 53 tokens, 31.58 tokens/second)
(Response: 53 tokens, 30.30 tokens/second)
(Response: 53 tokens, 29.43 tokens/second)
(Response: 53 tokens, 28.38 tokens/second)
(Response: 53 tokens, 27.59 tokens/second)
(Response: 53 tokens, 26.52 tokens/second)

Part 2 of Question While I can get great response from chat.py as:

- Turkey: Protein
- Stuffing: Carbohydrate
- Mashed Potatoes: Carbohydrate
- Salad: Vegetable
- Cranberry Sauce: Fruit

Using inference.py and the same chatml prompt template the result is not consice at all:

cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

# Initialize generator

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# Generate some text

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

#prompt = "Our story begins in the Scottish town of Auchtermuchty, where once"
context = "This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce"
system_message = "You are a very helpful AI assistant!"
user_prompt = """Extract all the edible items from the given context below.\nThen categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"\nOnly list them without any extra explanation.""" + "\nContext=" + context
prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant"

max_new_tokens = 250

generator.warmup()

time_begin = time.time()
output = generator.generate_simple(prompt, settings, max_new_tokens, seed = 1234, encode_special_tokens = True)

Result

system
You are a very helpful AI assistant!<|im_end|>
 user
Extract all the edible items from the given context below.
Then categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"
Only list them without any extra explanation.
Context=This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce<|im_end|>
 assistant Your request involves text extraction and categorization based on food pyramid. Here's how we can solve this problem:

First, let's extract all the edible items from the given context:

1. Roasted turkey
2. Stuffing
3. Mashed potatoes
4. Hearty salad
5. Cranberry sauce

Next, let's categorize these extracted items using the Food Pyramid as a guide:

1. Roasted turkey: Protein
2. Stuffing: Carbohydrate
3. Mashed potatoes: Carbohydrate
4. Hearty salad: Vegetable
5. Cranberry sauce: Fruit

So, the final output would be:

1. Roasted turkey: Protein
2. Stuffing: Carbohydrate
3. Mashed potatoes: Carbohydrate
4. Hearty salad: Vegetable
5. Cranberry sauce: Fruit

Please note that the actual categorization could vary depending on how strictly you want to follow the Food Pyramid. For example

Response generated in 5.14 seconds, 250 tokens, 48.63 tokens/second

@turboderp any recommendations is greatly appreciated :)

turboderp commented 5 months ago

Are you sure the prompt format is being correctly applied?

system
You are a very helpful AI assistant!<|im_end|>
 user
Extract all the edible items from the given context below.
Then categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"
Only list them without any extra explanation.
Context=This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce<|im_end|>
 assistant

Where did the <|im_start|> tags go? Also you probably want an extra \n at the end of the prompt.

It's normal for the speed to decrease as you build up a context. Seems to be dropping a little quicker than I'd expect, though. If you're on Linux, installing flash-attn can help a bunch with that, but it's a little harder to get working on Windows.

tednas commented 5 months ago

Thanks @turboderp for the fast response, your maintenance is awesome

You are right about the prompt template, that's why I started with chat.py since everything is applied properly within the code. Now I am using the format according to one of your previous response from other issues, Although <|im_start|> tag is there before calling the generator, but in the response, I do not see them

#prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n"
def format(prompt, response, system_prompt, settings):
    text = ""
    if system_prompt and system_prompt.strip() != "":
        text += "<|im_start|>system\n"
        text += system_prompt
        text += "\n<|im_end|>\n"
    text += "<|im_start|>user\n"
    text += prompt
    text += "<|im_end|>\n"
    text += "<|im_start|>assistant\n"
    if response:
        text += response
        text += "<|im_end|>\n"
    return text

prompt = format(user_prompt, None, system_prompt, settings)

Response

******************************
Log for prompt before calling generator
 <|im_start|>system
You are a very helpful AI assistant!
<|im_end|>
<|im_start|>user
Extract all the edible items from the given context below.
Then categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"
Only list them without any extra explanation.
Context=This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce<|im_end|>
<|im_start|>assistant

******************************
system
You are a very helpful AI assistant!
<|im_end|>
 user
Extract all the edible items from the given context below.
Then categorize these extracted items using the Food Pyramid as a guide like "Pizza: Carbohydrate" or "Salad: Vegetable"
Only list them without any extra explanation.
Context=This is my ideal Thanksgiving dinner, a roasted turkey with stuffing and mashed potatoes on the side plus a hearty salad and plenty of cranberry sauce<|im_end|>
 assistant
 Here's your categorized list:
- Roasted turkey: Protein
- Stuffing: Carbohydrate
- Mashed potatoes: Carbohydrate
- Hearty salad: Vegetable
- Cranberry sauce: Fruit

Remember, the Food Pyramid can vary by culture and country. This is just an example. Please consult a nutritionist for accurate information. Always consult a doctor before making significant changes to your diet. The Food Pyramid is a general guideline, and personal dietary needs may vary. Items can be classified differently based on specific nutritional requirements. Please consider these factors when using this information. This categorization is just for reference purposes only. Always consult a professional for dietary advice. This response is not intended to replace professional medical advice. Always consult a healthcare provider before making major changes to your diet. Different cultures and regions have different food pyramids so it's best to consult a local dietitian or nutritionist for accurate information. This categorization is intended only as a general guideline and should not replace the advice of a qualified healthcare professional. Always consult a health professional before making significant changes to

Response generated in 5.20 seconds, 250 tokens, 48.05 tokens/second

@turboderp need some advice to focus on the right direction:

Considering the use case of having thousands of prompts (fully independent from each other), which approach do you recommend? Working towards inference.py or chat.py
Inference.py output is really not as good as chat.py while chat.py response time will reduce eventually (I am a linux user)

turboderp commented 5 months ago

For evaluating many independent prompts I would consider batching. Depending on how much VRAM you have and how long the prompts and replies end up being, you could evaluate tens to hundreds at once.

It should also be possible to get the same output from inference.py. There's likely just some subtle difference in how the template is applied or how it's encoded, or possibly the sampling parameters are a little different.

As for the speed of chat.py, the reason it slows down is that it's accumulating a context, but that's not relevant if your queries are all independent. You can just reset the context. For chat.py the command line argument would be --amnesia

tednas commented 5 months ago

That's great, thank you so much for the details response :) Gonna test them

turboderp / exllamav2

Reducing generating response (40 token/s to 10 token/s) using chat.py #292