Closed gancancode closed 5 months ago
🤖
Hello @gancancode,
Nice to meet you! I'm Dosu, a bot designed to assist with the LangChain repository. I'm here to help you debug issues, answer any questions you might have, and guide you if you're interested in contributing. While you're waiting for a human maintainer, I'm at your service!
Don't worry, I'm currently looking into your issue and will provide a more comprehensive response shortly.
🤖
Based on the error message you provided, it seems that the issue arises from the get_tokenizer()
function in the language_model.py
file. This function is currently set to load the GPT-2 tokenizer from the Hugging Face model hub, which requires an internet connection. To use a locally hosted tokenizer, you would need to modify this function to load your local tokenizer instead.
Here's an example of how you might modify the get_tokenizer()
function to load a local tokenizer:
@lru_cache(maxsize=None) # Cache the tokenizer
def get_tokenizer() -> Any:
# create a local tokenizer instance
return YourLocalTokenizer.from_pretrained("/path/to/your/local/tokenizer")
In this example, YourLocalTokenizer
should be replaced with the class of your local tokenizer, and "/path/to/your/local/tokenizer"
should be replaced with the path to your local tokenizer.
Additionally, you might need to modify the TokenTextSplitter
and SentenceTransformersTokenTextSplitter
classes in the text_splitter.py
file to use your local tokenizer, as I mentioned in the context above.
Please note that these modifications would need to be done in a local copy of the LangChain codebase, and you would need to ensure that your local tokenizer is compatible with the LangChain framework.
I hope this helps! If you have any further questions or run into any issues, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot, please provide a sample code of the YourLocalTokenizer() assuming the gpt2 tokenizer is already in a local path.
🤖
Based on the information you've provided, it seems like you're trying to load a locally hosted GPT-2 tokenizer. The error message you're receiving indicates that the tokenizer is trying to be loaded from an online source, which is not possible in your offline setting.
The current implementation of the get_tokenizer()
function in the language_model.py
file is designed to load the tokenizer from an online source. To modify this function to load a local tokenizer, you would need to change the from_pretrained("gpt2")
method to point to your local directory where the 'gpt2' tokenizer files are stored.
Here's an example of how you can modify the get_tokenizer()
function to load a local tokenizer:
@lru_cache(maxsize=None) # Cache the tokenizer
def get_tokenizer() -> Any:
try:
from transformers import GPT2TokenizerFast
except ImportError:
raise ImportError(
"Could not import transformers python package. "
"This is needed in order to calculate get_token_ids. "
"Please install it with `pip install transformers`."
)
# create a GPT-2 tokenizer instance from a local directory
return GPT2TokenizerFast.from_pretrained("/path/to/your/local/directory")
In this example, replace "/path/to/your/local/directory"
with the exact local path where your 'gpt2' tokenizer files are stored.
Please note that the local directory should contain the necessary files for a GPT2TokenizerFast tokenizer. Typically, these files are vocab.json
, merges.txt
, tokenizer_config.json
, and special_tokens_map.json
.
If you're unsure about the exact local path where your 'gpt2' tokenizer files are stored, you might need to check your local file system or the documentation/source from where you obtained the local 'gpt2' tokenizer.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot, after replacing the path with the local path where the 'gpt2' tokenizer files are stored. The following errors are seen below. Please advise on the resolution.
~\Documents\Wheels\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
1794 if resolved_config_file is not None:
1795 with open(resolved_config_file, encoding="utf-8") as reader:
-> 1796 tokenizer_config = json.load(reader)
1797 if "fast_tokenizer_files" in tokenizer_config:
1798 fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
c:\Program Files\Anaconda3\lib\json\__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
291 kwarg; otherwise ``JSONDecoder`` is used.
292 """
--> 293 return loads(fp.read(),
294 cls=cls, object_hook=object_hook,
295 parse_float=parse_float, parse_int=parse_int,
c:\Program Files\Anaconda3\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
348 cls = JSONDecoder
c:\Program Files\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
c:\Program Files\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
🤖
The JSONDecodeError: Expecting value: line 1 column 1 (char 0)
error typically indicates that the JSON file you're trying to parse is empty or not correctly formatted. Given that the error is originating from the from_pretrained()
method when it's trying to load the tokenizer from the local directory, it's likely that one of the tokenizer files in your local directory is either empty or incorrectly formatted.
Here are a few steps you can take to troubleshoot this issue:
Check the contents of the tokenizer files: Open the vocab.json
, merges.txt
, tokenizer_config.json
, and special_tokens_map.json
files in your local directory and ensure that they are not empty and are correctly formatted. The vocab.json
file should be a JSON object where each key is a token and the corresponding value is the token's ID. The merges.txt
file should be a text file where each line is a token merge rule. The tokenizer_config.json
and special_tokens_map.json
files should be JSON objects with the tokenizer's configuration and special tokens map, respectively.
Check the path to the tokenizer files: Ensure that the path you're passing to the from_pretrained()
method is correct. The path should be a string that specifies the directory where the tokenizer files are located. If the path is incorrect, the from_pretrained()
method might be trying to load a non-existent file, which would result in a JSONDecodeError
.
Check the file permissions: Ensure that you have the necessary permissions to read the tokenizer files from the local directory. If you don't have the necessary permissions, you might not be able to open the tokenizer files, which could result in a JSONDecodeError
.
If you've checked all of these things and you're still encountering the JSONDecodeError
, it would be helpful if you could provide more information about the contents of your tokenizer files and the exact path you're passing to the from_pretrained()
method. This will help us better understand the issue and provide a more accurate solution.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hi, @gancancode,
I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you were seeking advice on modifying the code to use local general tokenizers, not just the GPT-2 tokenizer, in an offline setting for chatbot development with Langchain's ConversationalRetrievalChain or RetrievalQA. Dosu-bot provided guidance on modifying the get_tokenizer()
function in the language_model.py
file to load a local tokenizer instead of the GPT-2 tokenizer. After attempting the suggested modifications, you encountered errors and requested further advice on resolving the issues. Dosu-bot responded with troubleshooting steps to address the JSONDecodeError
and requested more information about the contents of the tokenizer files and the exact path being passed to the from_pretrained()
method.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!
@gancancode , were you able to use local LLM ?
System Info
I'm using a locally hosted LLM and want to apply Langchain's ConversationalRetrievalChain or RetrievalQA in an offline setting for chatbot developments, however there is an error as the current configuration do not support local hosted tokenizer.
Appreciate if you can help advise on the modifications to required codes to use local general tokenizers (not just gpt2 tokenizer but any tokenizer in general) in an offline setting.
Who can help?
@hwchase17 @agola11
Information
Related Components
Reproduction
Expected behavior
The prompt should obtain a chatbot response from the LLM via the retrieval augmented generation methods (ConversationalRetrievalChain or RetrievalQA) in langchain but failed to do so as the current configuration is unable to support local tokenizer.