intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.5k stars 1.24k forks source link

Starcoder has long 2nd Avg latency in history Multi-chat #8759

Closed KiwiHana closed 1 year ago

KiwiHana commented 1 year ago

OS: windows 11, 13900H bigdl 0815

long 2nd Avg latency from No.2 chat.

******* 1 ******* Write a Python function to generate the nth Fibonacci number
********** Switch model from  None to StarCoder-15.5b
******************* after del old model memory used: 254.1796875 MB
******* loading StarCoder-15.5b
WARNING:root:BigdlNativeForCausalLM has been deprecated, please switch to the new CausalLM API for sepcific models.
loading bigdl-llm model: loading model from 'D:\PC_LLM_UI\benchmark\checkpoint\\bigdl_llm_starcoder_q4_0.bin'
loading bigdl-llm model: n_vocab = 49152
loading bigdl-llm model: n_ctx   = 8192
loading bigdl-llm model: n_embd  = 6144
loading bigdl-llm model: n_head  = 48
loading bigdl-llm model: n_layer = 40
loading bigdl-llm model: ftype   = 2002
loading bigdl-llm model: qntvr   = 2
starcoder_model_load: ggml ctx size = 25050.47 MB
loading bigdl-llm model: memory size = 15360.00 MB, n_mem = 327680
loading bigdl-llm model: model size  =  9690.23 MB
********** model load time (s)=  5.259162187576294
******************* after load new model memory used: 9999.68359375 MB
C:\Users\BigDL-LLM\miniconda3\envs\ui_llm_v1\lib\site-packages\bigdl\llm\ggml\model\starcoder\starcoder.py:180: UserWarning: The parameter temperature is temporarily unsupported, please use the default value.
  warnings.warn(f"The parameter {unsupported_arg[index]} is temporarily "
C:\Users\BigDL-LLM\miniconda3\envs\ui_llm_v1\lib\site-packages\bigdl\llm\ggml\model\starcoder\starcoder.py:180: UserWarning: The parameter top_p is temporarily unsupported, please use the default value.
  warnings.warn(f"The parameter {unsupported_arg[index]} is temporarily "

******** max_length history 142
token count input:  12
token count output:  130
time cost(s):  24.908780574798584
First token latency(ms):  895.207405090332
After token latency(ms/token) 186.15172999773839
----------------------------------------
******* 2 ******* Does the above code have bug?
******** max_length prompt  149
******** max_length history 661
token count input:  149
token count output:  512
time cost(s):  145.67608880996704
First token latency(ms):  628.0303001403809
After token latency(ms/token) 283.85138651629484
----------------------------------------
******* 3 ******* Write another Python function class Solution to solve the following question.
                 Given a string, find the length of the longest substring without repeating characters.
                Example 1:
                Input: 'abcabcbb'
                 Output: 3
                 Explanation: The answer is
                 'abc' with the length of 3.
******** max_length prompt  742
******** max_length history 1254
token count input:  742
token count output:  512
time cost(s):  288.226683139801
First token latency(ms):  7536.905288696289
After token latency(ms/token) 549.2950642878762
----------------------------------------
******* 4 ******* Write another Python function class Solution to solve the following question.
                Given an array nums of n integers and an integer target, find three integers in nums such that the sum is closest to target. Return the sum of the three integers. You may assume that each input would have exactly one solution.
                Example:
                Given array nums = [-1, 2, 1, -4], and target = 1.
                The sum that is closest to the target is 2. (-1 + 2 + 1 = 2).
******** max_length prompt  1373
******** max_length history 1885
token count input:  1373
token count output:  512
time cost(s):  473.4350109100342
First token latency(ms):  14138.251781463623
After token latency(ms/token) 898.8194894883964

Attach code

from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer,TextStreamer
import gradio as gr
import mdtex2html
import argparse
import time
from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import sys
import gc
import os
import psutil
from bigdl.llm.ggml.model.chatglm.chatglm import ChatGLM
from bigdl.llm.transformers import BigdlNativeForCausalLM

DICT_FUNCTIONS3 = {
    "编程助手":  "{prompt}",
    "代码补全": "Completing code {prompt}\n\n"
}

##显示当前 python 程序占用的内存大小
def show_memory_info(hint):
    pid = os.getpid()
    p = psutil.Process(pid)

    info = p.memory_full_info()
    memory = info.uss / 1024. / 1024
    print('******************* {} memory used: {} MB'.format(hint, memory))

# LLama2 starcoder load 
def load(model_path, model_family, n_threads,n_ctx):
    llm = BigdlNativeForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_path,
        model_family=model_family,
        n_threads=n_threads,
        n_ctx=n_ctx)
    return llm

#def predict(input, function, chatbot, max_length, top_p, temperature, history, past_key_values,model_select):
def predict(input, function, chatbot, max_length, top_p, temperature, model_select):
    global model_name,model_all_local_path,model,history
    input = parse_text(input)
    if model_select != model_name:
        print("********** Switch model from ",model_name,"to",model_select)
        model_name = model_select      
        del model
        gc.collect()
        show_memory_info('after del old model')

        stm = time.time()
        if model_name == "llama2-13b":
            print("******* loading llama2-13b")
            model = load(model_path=model_all_local_path + "\\bigdl_llm_llama2_13b_q4_0.bin",
                    model_family='llama',
                    n_threads=20,n_ctx=4096)
        elif model_name == "StarCoder":
            print("******* loading StarCoder")
            model = load(model_path=model_all_local_path + "\\bigdl_llm_starcoder_q4_0.bin",
                    model_family='starcoder',
                    n_threads=20,n_ctx=4096)  
        print("********** model load time (s)= ", time.time() - stm)  
        show_memory_info('after load new model')  

    response = ""
    timeFirst = 0
    timeFirstRecord = False
    timeStart = time.time()
    #print("top_p, temperature, max_length = ", top_p, temperature, max_length)

    if model_name == "llama2-13b":
        template2 = DICT_FUNCTIONS2[function]
        #prompt = template2.format(prompt=input)
        if len(history) == 0:
            prompt = template2.format(prompt=input)
            history = prompt
        else:
            prompt = history + template2.format(prompt=input)
            print("******** max_length prompt ",len(model.tokenize(prompt)))
        for chunk in model(prompt, temperature=temperature,top_p=top_p,stream=True,max_tokens=max_length):   
            response += chunk['choices'][0]['text']
           # print(response)
            #  chatbot[-1] = (input, parse_text(response))
            if timeFirstRecord == False:
                timeFirst = time.time() - timeStart
                timeFirstRecord = True
        history = prompt + response
    elif model_name == "StarCoder":
        template3 = DICT_FUNCTIONS3[function]
        #  prompt = template3.format(prompt=input)
        if len(history) == 0:
            prompt = template3.format(prompt=input)
            history = prompt
        else:
            prompt = history + template3.format(prompt=input)
            print("******** max_length prompt ",len(model.tokenize(prompt)))

        for chunk in model(prompt, temperature=temperature,top_p=top_p,stream=True,max_tokens=max_length):   # ,max_tokens=32  ,max_tokens=500
            response += chunk['choices'][0]['text']

            if timeFirstRecord == False:
                timeFirst = time.time() - timeStart
                timeFirstRecord = True
        history = prompt + response

    timeCost = time.time() - timeStart

    token_count_input = len(model.tokenize(prompt))  
    token_count_output = len(model.tokenize(response))   

    ms_first_token = timeFirst * 1000
    ms_after_token = (timeCost - timeFirst) / (token_count_output - 1) * 1000
    print("******** max_length history",len(model.tokenize(history)))
#    print("input: ", prompt)
#    print("output: ", parse_text(response))
    print("token count input: ", token_count_input)
    print("token count output: ", token_count_output)
    print("time cost(s): ", timeCost)
    print("First token latency(ms): ", ms_first_token)
    print("After token latency(ms/token)", ms_after_token)
    print("-"*40)

if __name__ == '__main__':

    model_name = "None"
    model_all_local_path = "D:\\PC_LLM_UI\\benchmark\\checkpoint\\"
    model = None
    history=""

    input = "Write a Python function to generate the nth Fibonacci number"
    print("******* 1 *******",input)
    predict(input=input, function="编程助手", chatbot="", max_length=512, top_p=0.8, temperature=0.95, model_select="StarCoder") 

    input = "Does the above code have bug?"
    print("******* 2 *******",input)
    predict(input=input, function="编程助手", chatbot="", max_length=512, top_p=0.8, temperature=0.95, model_select="StarCoder") 

    input = "Write another Python function class Solution to solve the following question.\n \
            Given a string, find the length of the longest substring without repeating characters.\n\
            Example 1:\n\
            Input: 'abcabcbb'\n \
            Output: 3\n \
            Explanation: The answer is\n \
            'abc' with the length of 3.  "
    print("******* 3 *******",input)
    predict(input=input, function="编程助手", chatbot="", max_length=512, top_p=0.8, temperature=0.95, model_select="StarCoder")      

    input = "Write another Python function class Solution to solve the following question.\n\
            Given an array nums of n integers and an integer target, find three integers in nums such that the sum is closest to target. Return the sum of the three integers. You may assume that each input would have exactly one solution.\n\
            Example:\n\
            Given array nums = [-1, 2, 1, -4], and target = 1.\n\
            The sum that is closest to the target is 2. (-1 + 2 + 1 = 2). "
    print("******* 4 *******",input)
    predict(input=input, function="编程助手", chatbot="", max_length=512, top_p=0.8, temperature=0.95, model_select="StarCoder") 
shane-huang commented 1 year ago

Maybe similar reason as #8758.

MeouSker77 commented 1 year ago

starcoder's attention layer has two more memory copy than llama, and these copy's overhead increases rapidly as context length increases.

so starcoder's 2nd+ token latency is about 5X when context length = 1800, while llama's 2nd+ token latency is about 1.5X when context length = 1800

MeouSker77 commented 1 year ago

I have made a small improvement on starcoder's memory copy, now its 2nd+ token latency is about 3X when context length = 1800.

Here is brief performance table with this improvement: history length 2nd+ latency(ms/token)
142 183
661 241
1254 364
1885 532
KiwiHana commented 1 year ago

Thanks