ConversationalRetrievalChain using local LLM models and tokenizers

gancancode commented 8 months ago

System Info

I'm using a locally hosted LLM and want to apply Langchain's ConversationalRetrievalChain or RetrievalQA in an offline setting for chatbot developments, however there is an error as the current configuration do not support local hosted tokenizer.

Appreciate if you can help advise on the modifications to required codes to use local general tokenizers (not just gpt2 tokenizer but any tokenizer in general) in an offline setting.

> The error message is as follows:
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> C:\Users\MAS_RA~1\AppData\Local\Temp/ipykernel_3976/1814811930.py in
>      18     if query == '':
>      19         continue
> ---> 20     result = llama2_7B_qa(
>      21         {"question": query, "chat_history": chat_history})
>      22     print(f"{blue}Answer: " + result["answer"])
> ~\Documents\Wheels\langchain\chains\base.py in __call__(self, inputs, return_only_outputs, callbacks, tags, metadata, run_name, include_run_info)
>     290         except BaseException as e:
>     291             run_manager.on_chain_error(e)
> --> 292             raise e
>     293         run_manager.on_chain_end(outputs)
>     294         final_outputs: Dict[str, Any] = self.prep_outputs(
> ~\Documents\Wheels\langchain\chains\base.py in __call__(self, inputs, return_only_outputs, callbacks, tags, metadata, run_name, include_run_info)
>     284         try:
>     285             outputs = (
> --> 286                 self._call(inputs, run_manager=run_manager)
>     287                 if new_arg_supported
>     288                 else self._call(inputs)
> ~\Documents\Wheels\langchain\chains\conversational_retrieval\base.py in _call(self, inputs, run_manager)
>     132         )
>     133         if accepts_run_manager:
> --> 134             docs = self._get_docs(new_question, inputs, run_manager=_run_manager)
>     135         else:
>     136             docs = self._get_docs(new_question, inputs)  # type: ignore[call-arg]
> ~\Documents\Wheels\langchain\chains\conversational_retrieval\base.py in _get_docs(self, question, inputs, run_manager)
>     287             question, callbacks=run_manager.get_child()
>     288         )
> --> 289         return self._reduce_tokens_below_limit(docs)
>     290
>     291     async def _aget_docs(
> ~\Documents\Wheels\langchain\chains\conversational_retrieval\base.py in _reduce_tokens_below_limit(self, docs)
>     265             self.combine_docs_chain, StuffDocumentsChain
>     266         ):
> --> 267             tokens = [
>     268                 self.combine_docs_chain.llm_chain.llm.get_num_tokens(doc.page_content)
>     269                 for doc in docs
> ~\Documents\Wheels\langchain\chains\conversational_retrieval\base.py in (.0)
>     266         ):
>     267             tokens = [
> --> 268                 self.combine_docs_chain.llm_chain.llm.get_num_tokens(doc.page_content)
>     269                 for doc in docs
>     270             ] 
> ~\Documents\Wheels\langchain\schema\language_model.py in get_num_tokens(self, text)
>     252             The integer number of tokens in the text.
>     253         """
> --> 254         return len(self.get_token_ids(text))
>     255
>     256     def get_num_tokens_from_messages(self, messages: List[BaseMessage]) -> int:
> ~\Documents\Wheels\langchain\schema\language_model.py in get_token_ids(self, text)
>     239                 in the text.
>     240         """
> --> 241         return _get_token_ids_default_method(text)
>     242
>     243     def get_num_tokens(self, text: str) -> int:
> ~\Documents\Wheels\langchain\schema\language_model.py in _get_token_ids_default_method(text)
>      42     """Encode the text into token IDs."""
>      43     # get the cached tokenizer
> ---> 44     tokenizer = get_tokenizer()
>      45
>      46     # tokenize the text using the GPT-2 tokenizer
> ~\Documents\Wheels\langchain\schema\language_model.py in get_tokenizer()
>      36         )
>      37     # create a GPT-2 tokenizer instance
> ---> 38     return GPT2TokenizerFast.from_pretrained("gpt2")
>      39
>      40
> ~\Documents\Wheels\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
>    1836
>    1837         if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
> -> 1838             raise EnvironmentError(
>    1839                 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
>    1840                 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
> OSError: Can't load tokenizer for 'gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'gpt2' is the correct path to a directory containing all relevant files for a GPT2TokenizerFast tokenizer.

Who can help?

@hwchase17 @agola11

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[X] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

# ConversationalRetrievalChain 

from langchain.chains import ConversationalRetrievalChain
from langchain.llms import CTransformers

LocalHostedInteractiveBot = ConversationalRetrievalChain.from_llm(
    CTransformers(model="./models/llama-2-7b-chat.Q5_K_M.gguf", model_type="llama"),
    vectordb.as_retriever(search_kwargs={'k': 6}),
    return_source_documents=True,
    verbose=False,
    max_tokens_limit=1000
)

# Terminal interaction with locally hosted LLM 
chathistory = []

while True:
    query = input(f" Prompt: ")
    if query == "exit":
        print('Bye bye')
        sys.exit()
    if query == '':
        continue
    result = LocalHostedInteractiveBot(
        {"question": query, "chat_history": chathistory})
    print(f" Question: " + query)
    print(f"Answer: " + result["answer"])
    chat_history.append((query, result["answer"]))

Expected behavior

The prompt should obtain a chatbot response from the LLM via the retrieval augmented generation methods (ConversationalRetrievalChain or RetrievalQA) in langchain but failed to do so as the current configuration is unable to support local tokenizer.

dosubot[bot] commented 8 months ago

🤖

Hello @gancancode,

Nice to meet you! I'm Dosu, a bot designed to assist with the LangChain repository. I'm here to help you debug issues, answer any questions you might have, and guide you if you're interested in contributing. While you're waiting for a human maintainer, I'm at your service!

Don't worry, I'm currently looking into your issue and will provide a more comprehensive response shortly.

dosubot[bot] commented 8 months ago

🤖

Based on the error message you provided, it seems that the issue arises from the get_tokenizer() function in the language_model.py file. This function is currently set to load the GPT-2 tokenizer from the Hugging Face model hub, which requires an internet connection. To use a locally hosted tokenizer, you would need to modify this function to load your local tokenizer instead.

Here's an example of how you might modify the get_tokenizer() function to load a local tokenizer:

@lru_cache(maxsize=None)  # Cache the tokenizer
def get_tokenizer() -> Any:
    # create a local tokenizer instance
    return YourLocalTokenizer.from_pretrained("/path/to/your/local/tokenizer")

In this example, YourLocalTokenizer should be replaced with the class of your local tokenizer, and "/path/to/your/local/tokenizer" should be replaced with the path to your local tokenizer.

Additionally, you might need to modify the TokenTextSplitter and SentenceTransformersTokenTextSplitter classes in the text_splitter.py file to use your local tokenizer, as I mentioned in the context above.

Please note that these modifications would need to be done in a local copy of the LangChain codebase, and you would need to ensure that your local tokenizer is compatible with the LangChain framework.

I hope this helps! If you have any further questions or run into any issues, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

gancancode commented 8 months ago

@dosu-bot, please provide a sample code of the YourLocalTokenizer() assuming the gpt2 tokenizer is already in a local path.

dosubot[bot] commented 8 months ago

🤖

Based on the information you've provided, it seems like you're trying to load a locally hosted GPT-2 tokenizer. The error message you're receiving indicates that the tokenizer is trying to be loaded from an online source, which is not possible in your offline setting.

The current implementation of the get_tokenizer() function in the language_model.py file is designed to load the tokenizer from an online source. To modify this function to load a local tokenizer, you would need to change the from_pretrained("gpt2") method to point to your local directory where the 'gpt2' tokenizer files are stored.

Here's an example of how you can modify the get_tokenizer() function to load a local tokenizer:

@lru_cache(maxsize=None)  # Cache the tokenizer
def get_tokenizer() -> Any:
    try:
        from transformers import GPT2TokenizerFast
    except ImportError:
        raise ImportError(
            "Could not import transformers python package. "
            "This is needed in order to calculate get_token_ids. "
            "Please install it with `pip install transformers`."
        )
    # create a GPT-2 tokenizer instance from a local directory
    return GPT2TokenizerFast.from_pretrained("/path/to/your/local/directory")

In this example, replace "/path/to/your/local/directory" with the exact local path where your 'gpt2' tokenizer files are stored.

Please note that the local directory should contain the necessary files for a GPT2TokenizerFast tokenizer. Typically, these files are vocab.json, merges.txt, tokenizer_config.json, and special_tokens_map.json.

If you're unsure about the exact local path where your 'gpt2' tokenizer files are stored, you might need to check your local file system or the documentation/source from where you obtained the local 'gpt2' tokenizer.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

gancancode commented 8 months ago

@dosu-bot, after replacing the path with the local path where the 'gpt2' tokenizer files are stored. The following errors are seen below. Please advise on the resolution.


~\Documents\Wheels\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)

   1794                 if resolved_config_file is not None:

   1795                     with open(resolved_config_file, encoding="utf-8") as reader:

-> 1796                         tokenizer_config = json.load(reader)

   1797                         if "fast_tokenizer_files" in tokenizer_config:

   1798                             fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])

c:\Program Files\Anaconda3\lib\json\__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

    291     kwarg; otherwise ``JSONDecoder`` is used.

    292     """

--> 293     return loads(fp.read(),

    294         cls=cls, object_hook=object_hook,

    295         parse_float=parse_float, parse_int=parse_int,

c:\Program Files\Anaconda3\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)

    344             parse_int is None and parse_float is None and

    345             parse_constant is None and object_pairs_hook is None and not kw):

--> 346         return _default_decoder.decode(s)

    347     if cls is None:

    348         cls = JSONDecoder

c:\Program Files\Anaconda3\lib\json\decoder.py in decode(self, s, _w)

    335

    336         """

--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())

    338         end = _w(s, end).end()

    339         if end != len(s):

c:\Program Files\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)

    353             obj, end = self.scan_once(s, idx)

    354         except StopIteration as err:

--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None

    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

dosubot[bot] commented 8 months ago

🤖

The JSONDecodeError: Expecting value: line 1 column 1 (char 0) error typically indicates that the JSON file you're trying to parse is empty or not correctly formatted. Given that the error is originating from the from_pretrained() method when it's trying to load the tokenizer from the local directory, it's likely that one of the tokenizer files in your local directory is either empty or incorrectly formatted.

Here are a few steps you can take to troubleshoot this issue:

Check the contents of the tokenizer files: Open the vocab.json, merges.txt, tokenizer_config.json, and special_tokens_map.json files in your local directory and ensure that they are not empty and are correctly formatted. The vocab.json file should be a JSON object where each key is a token and the corresponding value is the token's ID. The merges.txt file should be a text file where each line is a token merge rule. The tokenizer_config.json and special_tokens_map.json files should be JSON objects with the tokenizer's configuration and special tokens map, respectively.
Check the path to the tokenizer files: Ensure that the path you're passing to the from_pretrained() method is correct. The path should be a string that specifies the directory where the tokenizer files are located. If the path is incorrect, the from_pretrained() method might be trying to load a non-existent file, which would result in a JSONDecodeError.
Check the file permissions: Ensure that you have the necessary permissions to read the tokenizer files from the local directory. If you don't have the necessary permissions, you might not be able to open the tokenizer files, which could result in a JSONDecodeError.

If you've checked all of these things and you're still encountering the JSONDecodeError, it would be helpful if you could provide more information about the contents of your tokenizer files and the exact path you're passing to the from_pretrained() method. This will help us better understand the issue and provide a more accurate solution.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 5 months ago

Hi, @gancancode,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you were seeking advice on modifying the code to use local general tokenizers, not just the GPT-2 tokenizer, in an offline setting for chatbot development with Langchain's ConversationalRetrievalChain or RetrievalQA. Dosu-bot provided guidance on modifying the get_tokenizer() function in the language_model.py file to load a local tokenizer instead of the GPT-2 tokenizer. After attempting the suggested modifications, you encountered errors and requested further advice on resolving the issues. Dosu-bot responded with troubleshooting steps to address the JSONDecodeError and requested more information about the contents of the tokenizer files and the exact path being passed to the from_pretrained() method.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!

AnuraagKhare commented 4 months ago

@gancancode , were you able to use local LLM ?

langchain-ai / langchain