langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.93k stars 15.38k forks source link

Warning: model not found. Using cl100k_base encoding. with Azure Openai and load_summarize_chain when I am trying to implement map_reduce #13224

Closed GonzaloMoreno closed 6 months ago

GonzaloMoreno commented 1 year ago

System Info

Hi: I have tried several strategies to implement map reduce summarization using Azure OpenAi and Langchain . My model is "gpt-35-turbo-16k".

I have tried several experiments but always get to the same warning:

from langchain.schema.document import Document from langchain.chains.mapreduce import MapReduceChain from langchain.text_splitter import CharacterTextSplitter from langchain.document_loaders import TextLoader llm_summary = AzureChatOpenAI( openai_api_base= azure_api_base, openai_api_version=azure_openai_api_version, deployment_name=azure_deployment_name, openai_api_key=azure_openai_api_key, openai_api_type= azure_api_type, model_name=azure_model_name, temperature=azure_model_temperature

)

text="""The ReduceDocumentsChain handles taking the document mapping results and reducing them into a single output.\ It wraps a generic CombineDocumentsChain (like StuffDocumentsChain) but adds the ability to collapse documents before passing it to the CombineDocumentsChain if their cumulative size exceeds token_max. In this example, we can actually re-use our chain for combining our docs to also collapse our docs."""

text1=""" You can continue with your English studies and never use Inversion in sentences. That’s perfectly okay. However, if you are preparing for a Cambridge or IELTS exam or other exams or situations where you need to demonstrate an extensive use of English, you will be expected to know about Inversion.

Let’s start with why and when. After all, if you don’t know why we use Inversion, you won’t know when to use it.

WHY & WHEN do we use INVERSION? Inversion is mainly used for EMPHASIS. The expressions used (never, rarely, no sooner, only then, etc.) have much more impact when used at the beginning of a sentence than the more common pronoun subject, especially as most of them are negative. Negatives are more dramatic. Consider negative contractions: don’t, won’t, can’t, haven’t, etc. They usually have strong stress in English whilst positive contractions: I’m, he’ll, she’s, we’ve, I’d, etc. usually have weak stress. """

doc= [Document(page_content=text1)]

chain = load_summarize_chain(llm_summary, chain_type="map_reduce") #chain_type="map_reduce"

chain.run(doc)`

and Strategy 2 with text_splitter:

`from langchain import PromptTemplate from langchain.chains.summarize import load_summarize_chain from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=50) chunks = text_splitter.create_documents([text1])

chain = load_summarize_chain( llm_summary, chain_type='map_reduce', verbose=False ) summary = chain.run(chunks) summary

I always get the same output:

image

I have some questions:

1) How to fix this warning? 2) Can I trust in the output when the model is not found?

Who can help?

@agola11

Information

Related Components

Reproduction

run this chunks of code in any notebook

Expected behavior

I want to fix this warning helping to langchain to find the model.

dosubot[bot] commented 1 year ago

🤖

Hello @GonzaloMoreno,

Nice to meet you! I'm here to assist you in navigating through issues and answering questions related to the LangChain repository. As a bot, I'm here to provide prompt assistance while human maintainers are not available.

Thank you for providing detailed information about your issue. I am currently reviewing your problem and will provide a comprehensive response shortly.

dosubot[bot] commented 1 year ago

🤖

Based on the context provided, the warning "model not found" typically indicates that the specified model, in this case "gpt-35-turbo-16k", is not deployed or available in your Azure OpenAI environment. To resolve this, you should ensure that the model is correctly deployed on Azure OpenAI and that the deployment name is correctly specified when creating an instance of the AzureOpenAI class.

Here is an example of how to create an instance of AzureOpenAI:

AzureOpenAI(
    deployment_name="35-turbo-dev",  # replace with your deployment name
    openai_api_version="2023-05-15",  # replace with your API version
)

If the model is not found, the output of the language model may not be reliable or may not be produced at all. Therefore, it's crucial to ensure that the model is correctly deployed and accessible.

In the context shared, the LangChain framework uses the tiktoken package to encode the input text. If the specified model is not found, it defaults to using the "cl100k_base" encoding. This means that even if the specified model is not found, the framework will still produce an output, but it may not be as accurate or reliable as the output produced by the specified model. Therefore, it's crucial to ensure that the specified model is correctly deployed and accessible.

Additionally, the context shared indicates that if you are trying to use a chat model like "gpt-3.5-turbo", you should initialize it using from langchain.chat_models import ChatOpenAI instead of the standard initialization method. This could be another reason why the model is not found.

The new context shared also shows that the LangChain framework validates the environment variables for the OpenAI API key, base URL, proxy, and organization. Make sure these are correctly set in your environment. If the OpenAI API key is not correctly set, the framework may not be able to access the specified model, leading to the "model not found" warning.

Regarding your second question, if the model is not found, the output of the language model may not be reliable or may not be produced at all. Therefore, it's crucial to ensure that the model is correctly deployed and accessible.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

GonzaloMoreno commented 1 year ago

@dosu-bot I have verified the deployment name and the api version and they are the right ones. The problem it seems to be with load_summarize_chain. It works fine when the method is "stuff", but it fails when it is "map_reduce"

dosubot[bot] commented 1 year ago

🤖

Based on the information you've provided, it seems like the "model not found" warning is due to the LangChain framework not recognizing the "gpt-35-turbo-16k" model. This could be because the model name is not included in the model_token_mapping dictionary in the modelname_to_contextsize method.

To resolve this issue, you can add the model "gpt-35-turbo-16k" to the model_token_mapping dictionary with its corresponding maximum context size. Here's how you can modify the modelname_to_contextsize method:

@staticmethod
def modelname_to_contextsize(modelname: str) -> int:
    model_token_mapping = {
        # ... existing model mappings ...
        "gpt-35-turbo-16k": <max_context_size_for_this_model>,  # Add your model here
    }

    # rest of the method...

Replace <max_context_size_for_this_model> with the maximum context size for the "gpt-35-turbo-16k" model. If you're unsure about this value, you might need to refer to the documentation provided by the model's provider.

You can find the modelname_to_contextsize method in the libs/langchain/langchain/llms/openai.py file.

As for the load_summarize_chain function, it is used to load a specific type of document combining chain for summarizing text. The function takes in a language model (llm), the type of chain to use (chain_type), and whether the chains should be run in verbose mode or not (verbose). The llm parameter that is passed to the load_summarize_chain function is of type BaseLanguageModel, which can be implemented by any language model, including the Azure OpenAI model. The llm is then passed to the loading functions and used to create the chains.

You can find more details about the load_summarize_chain function in the libs/langchain/langchain/chains/summarize/init.py file.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

glejdis commented 9 months ago

@GonzaloMoreno did you find a solution?

dosubot[bot] commented 9 months ago

🤖

Based on the information you've provided, it seems like the warning "model not found" is being raised because the specified model "gpt-35-turbo-16k" is not found in the tiktoken package's list of supported models. This happens in the get_token_ids method of the BaseOpenAI class in the LangChain codebase.

Here is the relevant code snippet:

try:
    enc = tiktoken.encoding_for_model(model_name)
except KeyError:
    logger.warning("Warning: model not found. Using cl100k_base encoding.")
    model = "cl100k_base"
    enc = tiktoken.get_encoding(model)

You can find this code in the libs/community/langchain_community/llms/openai.py file in the LangChain repository.

To resolve this warning, you can either use a model that is supported by the tiktoken package, or you can ignore the warning if you are okay with the fallback model "cl100k_base" being used.

As for the trustworthiness of the output when the model is not found, it depends on your specific use case. The fallback model "cl100k_base" is used for encoding the text, which might not be as accurate or suitable as the specified model "gpt-35-turbo-16k". If the encoding of the text is critical for your application, it would be better to use a supported model. If not, the output can still be trusted to a certain extent, but with the understanding that it might not be as accurate or suitable.

Regarding the load_summarize_chain function, it is used to load a specific type of summarizing chain based on the chain_type argument provided. The summarizing chain is used for processing and summarizing documents. The function interacts with the specified model (llm) by passing it to the corresponding chain loading function (_load_stuff_chain, _load_map_reduce_chain, or _load_refine_chain). These functions use the model to create instances of LLMChain, which are then used to create instances of the specific chain types (StuffDocumentsChain, MapReduceDocumentsChain, or RefineDocumentsChain).

You can find more details about this function in the libs/langchain/langchain/chains/summarize/init.py file in the LangChain repository.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nyck33 commented 9 months ago

+1 I get similar errors https://github.com/langchain-ai/langchain/issues/17409#issue-2130090661

dosubot[bot] commented 6 months ago

Hi, @GonzaloMoreno,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From the discussion, it seems that Dosubot provided detailed guidance on resolving the "model not found" warning by checking the model deployment and API version, modifying the modelname_to_contextsize method, adding the model to the model_token_mapping dictionary, and providing insights into the load_summarize_chain function. You verified the deployment name and API version, and the issue seems to occur specifically with the "map_reduce" method.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation.

auserwn commented 5 months ago

Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding. Warning: model not found. Using cl100k_base encoding.

how to reslove the problem?

amzfc commented 6 days ago

调试过程

 openai.py 697行 self.embed_documents([text])[0]  openai.py 668行 self._get_len_safe_embeddings(texts, engine=engine)  openai.py 459行,根据模型名选择 编码,(不同的模型使用不同的编码)

结论

  1. OpenAIEmbeddings 初始化时,如果不指定 model 参数,则会默认使用 cl100k_base 编码
  2. 另外需要注意,openAI模型使用的三种编码:gpt2、p50k_base和cl100k_base,当你不是openAI模型,而是其他模型,比如从huggface上下载的模型,它们有自己的编码和语料库,(tokenizer.json,tokenizer_config.json,vocab.txt),如果使用cl100k_base 编码,那么向量化后的结果是不对的。

    测试

def tiktoken_token(text): import tiktoken enc = tiktoken.get_encoding("cl100k_base") enc_output = enc.encode(text) return enc_output

def huggface_download_token(text) from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="your download file path" ) tokenized = tokenizer.encode(text, add_special_tokens=False) returrn tokenized

两个函数会得到不同的结果