2048 context length limit about qwen-7b-chat

Hspix commented 1 year ago

Bug Description

Integrated with langchain, Qwen-7B-Chat model is deployed under FastChat and vLLM, which OpenAI API is employed. When the number of input tokens is more than 2048, it raise

openai.error.APIError: Invalid response object from API: '{"object":"error","message":"This model\'s maximum context length is 2048 tokens. However, you requested 2167 tokens (1655 in the messages, 512 in the completion). Please reduce the length of the messages or completion.","code":40303}' (HTTP response code was 400)

However, it shouldn't happend when use_dynamic_ntk and use_logn_attn is set to true in config.json file of model.

Steps to Reproduce

python3 -m fastchat.serve.controller
python3 -m fastchat.serve.vllm_worker --model-path ** --trust-remote-code --model-names qwen-7b-chat
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Packages

vllm==0.1.4
fschat==0.2.24
langchain==0.0.274
openai==0.27.9

Code piece

    from langchain.prompts import PromptTemplate
    from langchain.chat_models import ChatOpenAI
    from langchain.schema import HumanMessage

    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    model_name = "qwen-7b-chat"

    prompt = PromptTemplate(template=template, input_variables=["html_text"])
    model = ChatOpenAI(
        model=model_name, openai_api_key=openai_api_key, openai_api_base=openai_api_base, verbose=True,
        # use_dynamc_ntk=True, use_logn_attn=True, #no effect
        # model_kwargs={'use_dynamc_ntk': True, 'use_logn_attn': True} #no effect
    )
    query = prompt.format_prompt(html_text=html_text).to_string()
    output = model([HumanMessage(content=query)]) # raise the Exception

123zzw commented 1 year ago

I also have the same problem. Have you solved it？

zdj-1995 commented 1 year ago

同样的2048限制

Trangle commented 1 year ago

This issue is submitted to vllm project/vllm, and NTK support is currently being debugged. The default implementation supports the basic encoding length of 2048. You can also use 'model_worker' for loading first

Hspix commented 1 year ago

Can use flash-attn under the model_worker?

zdj-1995 commented 1 year ago

只需要改一下默认就好了问题可以关闭了吧代码没有错误@Trangle

RipperTs commented 1 year ago

只需要改一下默认就好了问题可以关闭了吧代码没有错误@Trangle

请问是哪里的默认呢？

RipperTs commented 1 year ago

I also have the same problem. Have you solved it？

hanswang1 commented 9 months ago

When I am using vicuna-7b-v1.5 model under FastChat and vLLM, I got 4096 token prompt limitation. MicrosoftTeams-image (19)

I think this issue related to this question. How to solve this problem?

hanswang1 commented 9 months ago

只需要改一下默认就好了问题可以关闭了吧代码没有错误@Trangle

请问是哪里的默认呢？

Same question.

lm-sys / FastChat