DOC: Summarization 'map_reduce' - Can't load tokenizer for 'gpt2'

analyticsinsights commented 8 months ago

Issue with current documentation:

The documentation describes the different options for summarizing a text, for longer texts the 'map_reduce' option is suggested. It is mentioned further under 'Go deeper' that it is possible to use different LLMs via the llm parameter. This seems to work well using the code below with the chain_type='stuff' parameter and, in particular, using a local model (in the example below this model is used).

from langchain.document_loaders import PyPDFLoader
from langchain.llms import CTransformers
from langchain.chains.summarize import load_summarize_chain

# load a PDF-file
loader = PyPDFLoader("C:/xyz.pdf")
docs = loader.load()

# use a local LLAMA2 model
llm = CTransformers(model='./models/llama-2-7b-chat.Q5_K_M.gguf', model_type='llama', config={'context_length': 4096, 'max_new_tokens': 256, 'temperature': 0}, local_files_only=True)

# summarise the text (this works only if it fits into the context length of ~4000 tokens)
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(docs)

However, surprisingly, it returns the following error when using the chain_type='map_reduce' parameter: 'OSError: Can't load tokenizer for 'gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'gpt2' is the correct path to a directory containing all relevant files for a GPT2TokenizerFast tokenizer.'

The suggestion mentioned in this issue on Github doesn't work for the local model used above. It would be great to have more specific information in the LangChain documentation on (1) how to perform text summarization with LangChain using different LLMs, and (2) specifically for using local models that don't require an internet connection and/or require gpt2. Since the above code works with the parameter chain_type='stuff' but not with the parameter chain_type='map_reduce', it would be important to explain what happens under the hood so users can make this work for local models.

Idea or request for content:

No response

dosubot[bot] commented 8 months ago

🤖

It's great to see your interest in contributing to LangChain! Your detailed description of the issue and the specific use case is very helpful. It would be fantastic if you could consider creating a pull request to improve the documentation based on your experience. Your insights and suggested improvements would be valuable for other users facing similar challenges. Thank you for considering this contribution!

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

analyticsinsights commented 8 months ago

Not sure how the bot's comment helps. Could someone please take a look?

The code below is another example which fully reproduces the error. The pipeline points to a specific tokenizer and model (using t5-small here as an example), but when running summarization_chain.run(docs[20:23]) LangChain still tries to download the gpt2 model to work with that (as mentioned in my previous post, this appears not to happen with chain_type='stuff' but only with chain_type='map_reduce').

The error message 'OSError: Can't load tokenizer for 'gpt2'...' is created when the internet connection is switched off before executing the last line of code. With an internet connection, LangChain downloads the gpt2 model and uses it to perform the text summarization task.

The idea behind the code below is to perform a summarization of long texts using a specific model/tokenizer combination downloaded beforehand, so that the code can be executed offline.

How can this be achieved?

import requests
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain

url = "https://www.gutenberg.org/cache/epub/71224/pg71224.txt"
response = requests.get(url)
if response.status_code == 200:
    data = response.text

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=1000, chunk_overlap=200)
docs = text_splitter.create_documents([data])

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

pipe_summary = pipeline("summarization", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe_summary)

summarization_chain = load_summarize_chain(llm=llm, chain_type='map_reduce')

# if the internet connection is switched off here, then the following row produces an error since the attempt to download the gpt2-model from Huggingface fails
summarization_chain.run(docs[20:23])

dosubot[bot] commented 8 months ago

It seems like this issue requires a deeper look, @baskaryan could you please lend your expertise?

analyticsinsights commented 8 months ago

It seems like this issue requires a deeper look, @baskaryan could you please lend your expertise?

Any reactions?

AhmadHakami commented 7 months ago

any updates?

jsxyhelu commented 6 months ago

any updates?why?

YoloZyk commented 6 months ago

any solusions? help!

CVer2022 commented 6 months ago

This question seems to have been encountered a long time ago, but it is still unresolved and relevant

SubrataSarkar32 commented 6 months ago

Replace llm = HuggingFacePipeline(pipeline=pipe_summary) with llm = HuggingFacePipeline(model_id="google-t5/t5-small",pipeline=pipe_summary, model_kwargs:{"pretrained_model_name_or_path":"google-t5/t5-small"})

OR

if you are using locally downloaded model files llm = HuggingFacePipeline(model_id="google-t5/t5-small",pipeline=pipe_summary, model_kwargs:{"pretrained_model_name_or_path":"/you/local/downloaded/path/models--google-t5--t5-small", "local_files_only": True})

This is happening because in the file https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_pipeline.py the following value is set DEFAULT_MODEL_ID = "gpt2" and it is not getting updated.

joostjansenn commented 5 months ago

Replace llm = HuggingFacePipeline(pipeline=pipe_summary) with llm = HuggingFacePipeline(model_id="google-t5/t5-small",pipeline=pipe_summary, model_kwargs:{"pretrained_model_name_or_path":"google-t5/t5-small"})

OR

if you are using locally downloaded model files llm = HuggingFacePipeline(model_id="google-t5/t5-small",pipeline=pipe_summary, model_kwargs:{"pretrained_model_name_or_path":"/you/local/downloaded/path/models--google-t5--t5-small", "local_files_only": True})

This is happening because in the file https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_pipeline.py the following value is set DEFAULT_MODEL_ID = "gpt2" and it is not getting updated.

Can others confirm that this is the right solution?

Civel-1 commented 3 months ago

Still encountering this problem when working with custom LLM class that is a openapi compatible vLLM server. Any lead?

langchain-ai / langchain

DOC: Summarization 'map_reduce' - Can't load tokenizer for 'gpt2' #15347

Issue with current documentation:

Idea or request for content: