lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.92k stars 4.55k forks source link

2048 context length limit about qwen-7b-chat #2324

Open Hspix opened 1 year ago

Hspix commented 1 year ago

Bug Description

Integrated with langchain, Qwen-7B-Chat model is deployed under FastChat and vLLM, which OpenAI API is employed. When the number of input tokens is more than 2048, it raise

openai.error.APIError: Invalid response object from API: '{"object":"error","message":"This model\'s maximum context length is 2048 tokens. However, you requested 2167 tokens (1655 in the messages, 512 in the completion). Please reduce the length of the messages or completion.","code":40303}' (HTTP response code was 400)

However, it shouldn't happend when use_dynamic_ntk and use_logn_attn is set to true in config.json file of model.

Steps to Reproduce

  1. python3 -m fastchat.serve.controller
  2. python3 -m fastchat.serve.vllm_worker --model-path ** --trust-remote-code --model-names qwen-7b-chat
  3. python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Packages

  1. vllm==0.1.4
  2. fschat==0.2.24
  3. langchain==0.0.274
  4. openai==0.27.9

Code piece

    from langchain.prompts import PromptTemplate
    from langchain.chat_models import ChatOpenAI
    from langchain.schema import HumanMessage

    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    model_name = "qwen-7b-chat"

    prompt = PromptTemplate(template=template, input_variables=["html_text"])
    model = ChatOpenAI(
        model=model_name, openai_api_key=openai_api_key, openai_api_base=openai_api_base, verbose=True,
        # use_dynamc_ntk=True, use_logn_attn=True, #no effect
        # model_kwargs={'use_dynamc_ntk': True, 'use_logn_attn': True} #no effect
    )
    query = prompt.format_prompt(html_text=html_text).to_string()
    output = model([HumanMessage(content=query)]) # raise the Exception
123zzw commented 1 year ago

I also have the same problem. Have you solved it?

zdj-1995 commented 1 year ago

同样的2048限制

Trangle commented 1 year ago

This issue is submitted to vllm project/vllm, and NTK support is currently being debugged. The default implementation supports the basic encoding length of 2048. You can also use 'model_worker' for loading first

Hspix commented 1 year ago

Can use flash-attn under the model_worker?

zdj-1995 commented 1 year ago

只需要改一下默认就好了 问题可以关闭了吧 代码没有错误@Trangle

RipperTs commented 1 year ago

只需要改一下默认就好了 问题可以关闭了吧 代码没有错误@Trangle

请问是哪里的默认呢?

RipperTs commented 1 year ago

I also have the same problem. Have you solved it?

hanswang1 commented 9 months ago

When I am using vicuna-7b-v1.5 model under FastChat and vLLM, I got 4096 token prompt limitation. MicrosoftTeams-image (19)

I think this issue related to this question. How to solve this problem?

hanswang1 commented 9 months ago

只需要改一下默认就好了 问题可以关闭了吧 代码没有错误@Trangle

请问是哪里的默认呢?

Same question.