Closed ZinanYang1995 closed 9 months ago
π€
Hello again, @ZinanYang1995! It's great to see you diving deeper into the world of Pinecone and LangChain. I'm here to assist you, as always. Let's take a look at your code.
Based on your code and the requirements you've mentioned, you can modify your code to check if a PDF has already been stored as embeddings in Pinecone. Here's a general approach:
Generate a unique identifier for each PDF. This could be a hash of the file's contents, or a UUID if the same PDF can be uploaded multiple times and should be treated as a different document each time.
Before adding the PDF's embeddings to Pinecone, check if an embedding with the same identifier already exists in Pinecone.
If the embedding already exists, use the existing embedding. If not, generate a new embedding and add it to Pinecone.
Here's a rough example of how you might implement this:
import hashlib
# Generate a unique identifier for the PDF
pdf_content = pdf.read()
pdf_hash = hashlib.sha256(pdf_content).hexdigest()
# Check if the PDF's embeddings already exist in Pinecone
if pdf_hash in index:
# The embeddings already exist, so use them
embeddings = index[pdf_hash]
else:
# The embeddings don't exist, so generate them and add them to Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed(pdf_content)
index[pdf_hash] = embeddings
Please note that this is a simplified example and may not work as-is. You'll need to adapt it to fit your specific use case and handle potential errors.
Also, please note that this approach assumes that Pinecone allows you to check if an embedding with a specific identifier exists, and to retrieve an existing embedding by its identifier. You would need to consult Pinecone's documentation or API reference to confirm this and to see how to implement these operations.
As for handling PDF files in LangChain, it uses various loaders to load and process PDF documents. The specific way embeddings are handled would depend on the specific loader and the implementation of its methods. For example, you're using the UnstructuredPDFLoader
in your code to load the PDF file and return a Document
object, which can then be used for further processing.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
Thank you for your response. I want to know how can I achieve the same thing using the UUID method to check if the PDF embedding is already exist in Pinecone. Could you give me examples and show me what I need to change in my code?
Here is my code so far:
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model
''')
add_vertical_space(5)
st.write('Made with β€οΈ by Zi')
def main(): st.header("Zi's PDF Helper: Chat with PDF")
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
pdf_reader = PdfReader(pdf)
# data = pdf_reader.load()
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = [Document(page_content=section) for section in sections]
# st.write(chunks)
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size = 500,
# chunk_overlap = 20
# )
# chunks = text_splitter.split_documents(data)
## embeddings
# Set up embeddings
embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
try:
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine") # Adjust the dimension as per your embeddings
index = pinecone.Index(index_name)
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
# Create or Load Chat History
if pdf:
# generate chat history
chat_history_file = f"{pdf.name}_chat_history.pkl"
# load history if exist
if os.path.exists(chat_history_file):
with open(chat_history_file, "rb") as f:
chat_history = pickle.load(f)
else:
chat_history = []
# Initialize chat_history in session_state if not present
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Check if 'prompt' is in session state
if 'last_input' not in st.session_state:
st.session_state.last_input = ''
# User Input
current_prompt = st.session_state.get('user_input', '')
prompt_placeholder = st.empty()
prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
submit_button = st.button("Submit")
if docsearch is not None and submit_button and prompt:
# Update the last input in session state
st.session_state.last_input = prompt
docs = docsearch.similarity_search(query=prompt, k=3)
#llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)
message = [
SystemMessage(content="You are a helpful assistant"),
HumanMessage(content=prompt)
]
chain = load_qa_chain(llm=chat, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=message)
print(cb)
# st.write(response)
# st.write(docs)
# Process the response using AIMessage schema
# ai_message = AIMessage(content="AI message content")
# ai_message.content = response.generations[0].message.content
# Add to chat history
st.session_state.chat_history.append((prompt, response))
# Save chat history
with open(chat_history_file, "wb") as f:
pickle.dump(st.session_state.chat_history, f)
# Clear the input after processing
prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt")
# Display the entire chat
chat_content = ""
for user_msg, bot_resp in st.session_state.chat_history:
chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
st.markdown(chat_content, unsafe_allow_html=True)
if name == 'main': main()
π€
To implement the functionality you described, you can generate a unique identifier (UUID) for each PDF and use it as a key to store and retrieve the embeddings from Pinecone. Here's how you can modify your code:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
Here's how your modified code would look:
## Imports
# ...
## User Interface
# Side Bar
# ...
# Main Page
def main():
# ...
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# read PDF
if pdf is not None:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
# ...
try:
# Set up Pinecone
# ...
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
# ...
if __name__ == '__main__':
main()
This way, you avoid generating embeddings for the same PDF multiple times, which can save a lot of computational resources. Please note that this approach assumes that the content of a PDF does not change if its name remains the same. If a PDF's content can change while its name remains the same, you might want to consider using a hash of the PDF's content instead of its name to generate the UUID.
I hope this helps! Let me know if you have any other questions.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
I updated the code with the changes above but got a error message I cannot solve. The error message is this:
UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value
Traceback:
File "C:\Users\zy73\AppData\Roaming\Python\Python311\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "C:\Users\zy73\OneDrive\Desktop\AI Research\langchain\pdf.py", line 175, in
Here is my code:
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model
''')
add_vertical_space(5)
st.write('Made with β€οΈ by Zi')
def main(): st.header("Zi's PDF Helper: Chat with PDF")
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
pdf_reader = PdfReader(pdf)
# data = pdf_reader.load()
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = [Document(page_content=section) for section in sections]
# st.write(chunks)
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size = 500,
# chunk_overlap = 20
# )
# chunks = text_splitter.split_documents(data)
## embeddings
# Set up embeddings
embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
try:
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine")
index = pinecone.Index(index_name)
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
# Create or Load Chat History
if pdf:
# generate chat history
chat_history_file = f"{pdf.name}_chat_history.pkl"
# load history if exist
if os.path.exists(chat_history_file):
with open(chat_history_file, "rb") as f:
chat_history = pickle.load(f)
else:
chat_history = []
# Initialize chat_history in session_state if not present
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Check if 'prompt' is in session state
if 'last_input' not in st.session_state:
st.session_state.last_input = ''
# User Input
current_prompt = st.session_state.get('user_input', '')
prompt_placeholder = st.empty()
prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
submit_button = st.button("Submit")
if docsearch is not None and submit_button and prompt:
# Update the last input in session state
st.session_state.last_input = prompt
docs = docsearch.similarity_search(query=prompt, k=3)
#llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)
message = [
SystemMessage(content="You are a helpful assistant"),
HumanMessage(content=prompt)
]
chain = load_qa_chain(llm=chat, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=message)
print(cb)
# st.write(response)
st.write(docs)
# Process the response using AIMessage schema
# ai_message = AIMessage(content="AI message content")
# ai_message.content = response.generations[0].message.content
# Add to chat history
st.session_state.chat_history.append((prompt, response))
# Save chat history
with open(chat_history_file, "wb") as f:
pickle.dump(st.session_state.chat_history, f)
# Clear the input after processing
prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt")
# Display the entire chat
chat_content = ""
for user_msg, bot_resp in st.session_state.chat_history:
chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
st.markdown(chat_content, unsafe_allow_html=True)
if name == 'main': main()
π€
From the error message you provided, it seems like the variable docsearch
is not defined in the scope where it's being used. This is causing the UnboundLocalError
. To resolve this, you need to ensure that docsearch
is defined before it's used in your code.
However, without seeing the rest of your code, it's hard to provide a specific solution. The docsearch
variable is not a part of the LangChain framework based on the repository information provided, so it's likely a variable specific to your application.
Regarding your question about checking if a PDF embedding already exists in Pinecone, you can use the CacheBackedEmbeddings
class in LangChain. This class has a method embed_documents
which first checks the cache for the embeddings. If the embeddings are not found, the method uses the underlying embedder to embed the documents and stores the results in the cache.
Here is an example of how to use it:
from langchain.embeddings import CacheBackedEmbeddings, OpenAIEmbeddings
from langchain.storage import LocalFileStore
store = LocalFileStore('./my_cache')
underlying_embedder = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
underlying_embedder, store, namespace=underlying_embedder.model
)
# Embedding is computed and cached
embeddings = embedder.embed_documents(["hello", "goodbye"])
# Embeddings are retrieved from the cache, no computation is done
embeddings = embedder.embed_documents(["hello", "goodbye"])
In this example, the embed_documents
method is called twice. The first time, the embeddings are computed and cached. The second time, the embeddings are retrieved from the cache, and no computation is done. If the embeddings for the specified documents did not exist in the cache, they would be computed and stored in the cache.
For the UUID method, you can use the uuid
library in Python. Here is an example of how to generate a UUID:
import uuid
# Generate a random UUID
my_uuid = uuid.uuid4()
print(my_uuid)
You can use this UUID as a unique identifier for your PDF files. When you want to check if a PDF has already been stored, you can look up this UUID in your Pinecone vector store.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
I have the UUID method built into my code. But I have got the error message File "C:\Users\zy73\OneDrive\Desktop\AI Research\langchain\pdf.py", line 133, in main if docsearch is not None and submit_button and prompt: ^^^^^^^^^ UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value.
Please help me debug this issue and show we what should be fixed in my code:
Here is my entire code so far:
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model
''')
add_vertical_space(5)
st.write('Made with β€οΈ by Zi')
def main(): st.header("Zi's PDF Helper: Chat with PDF")
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
pdf_reader = PdfReader(pdf)
# data = pdf_reader.load()
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = [Document(page_content=section) for section in sections]
# st.write(chunks)
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size = 500,
# chunk_overlap = 20
# )
# chunks = text_splitter.split_documents(data)
## embeddings
# Set up embeddings
embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
try:
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine")
index = pinecone.Index(index_name)
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
# Create or Load Chat History
if pdf:
# generate chat history
chat_history_file = f"{pdf.name}_chat_history.pkl"
# load history if exist
if os.path.exists(chat_history_file):
with open(chat_history_file, "rb") as f:
chat_history = pickle.load(f)
else:
chat_history = []
# Initialize chat_history in session_state if not present
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Check if 'prompt' is in session state
if 'last_input' not in st.session_state:
st.session_state.last_input = ''
# User Input
current_prompt = st.session_state.get('user_input', '')
prompt_placeholder = st.empty()
prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
submit_button = st.button("Submit")
if docsearch is not None and submit_button and prompt:
# Update the last input in session state
st.session_state.last_input = prompt
docs = docsearch.similarity_search(query=prompt, k=3)
#llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)
message = [
SystemMessage(content="You are a helpful assistant"),
HumanMessage(content=prompt)
]
chain = load_qa_chain(llm=chat, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=message)
print(cb)
# st.write(response)
st.write(docs)
# Process the response using AIMessage schema
# ai_message = AIMessage(content="AI message content")
# ai_message.content = response.generations[0].message.content
# Add to chat history
st.session_state.chat_history.append((prompt, response))
# Save chat history
with open(chat_history_file, "wb") as f:
pickle.dump(st.session_state.chat_history, f)
# Clear the input after processing
prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt")
# Display the entire chat
chat_content = ""
for user_msg, bot_resp in st.session_state.chat_history:
chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
st.markdown(chat_content, unsafe_allow_html=True)
if name == 'main': main()
π€
The error you're encountering, "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value", is typically raised when a local variable is referenced before it has been assigned a value. In your code, the variable 'docsearch' is defined within the try block. If an exception occurs before 'docsearch' is defined, it will not be assigned a value, and you will encounter this error when you try to access it later.
To resolve this issue, you should initialize 'docsearch' before the try block to ensure it has a value in the event of an exception. Here's how you can modify your code:
docsearch = None # Initialize docsearch
try:
# Your existing code...
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
Now, 'docsearch' is guaranteed to have a value (even if it's None) when you try to access it later in your code.
Regarding the line if docsearch is not None and submit_button and prompt:
, it seems like it's not included in the code snippet you provided. If this line is causing an error, please ensure that 'submit_button' and 'prompt' are also defined before this line is executed.
For more information about the UnboundLocalError, you can refer to the Python documentation. If you have any more questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
Now I am using UUID in my code determine whether embeddings have already been created for certain PDF. How can I use this UUID paremeter in my LLM calls so that user can call the entire PDF using UUID if needed? Give me example of where I can change my code.
Here is my code so far:
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model
''')
add_vertical_space(5)
st.write('Made with β€οΈ by Zi')
def main(): st.header("Zi's PDF Helper: Chat with PDF")
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
pdf_reader = PdfReader(pdf)
# data = pdf_reader.load()
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = [Document(page_content=section) for section in sections]
# st.write(chunks)
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size = 500,
# chunk_overlap = 20
# )
# chunks = text_splitter.split_documents(data)
## embeddings
# Set up embeddings
embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
docsearch = None
try:
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine")
index = pinecone.Index(index_name)
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
# Create or Load Chat History
if pdf:
# generate chat history
chat_history_file = f"{pdf.name}_chat_history.pkl"
# load history if exist
if os.path.exists(chat_history_file):
with open(chat_history_file, "rb") as f:
chat_history = pickle.load(f)
else:
chat_history = []
# Initialize chat_history in session_state if not present
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Check if 'prompt' is in session state
if 'last_input' not in st.session_state:
st.session_state.last_input = ''
# User Input
current_prompt = st.session_state.get('user_input', '')
prompt_placeholder = st.empty()
prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
submit_button = st.button("Submit")
if docsearch is not None and submit_button and prompt:
# Update the last input in session state
st.session_state.last_input = prompt
docs = docsearch.similarity_search(query=prompt, k=3)
#llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)
message = [
SystemMessage(content="You are a helpful assistant"),
HumanMessage(content=prompt)
]
chain = load_qa_chain(llm=chat, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=message)
print(cb)
# st.write(response)
st.write(docs)
# Process the response using AIMessage schema
# ai_message = AIMessage(content="AI message content")
# ai_message.content = response.generations[0].message.content
# Add to chat history
st.session_state.chat_history.append((prompt, response))
# Save chat history
with open(chat_history_file, "wb") as f:
pickle.dump(st.session_state.chat_history, f)
# Clear the input after processing
prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt")
# Display the entire chat
chat_content = ""
for user_msg, bot_resp in st.session_state.chat_history:
chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
st.markdown(chat_content, unsafe_allow_html=True)
if name == 'main': main()
π€
The error "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value" is likely due to the fact that the 'docsearch' variable is not defined in the scope where you are trying to use it.
In your code, 'docsearch' is defined within an if-else condition:
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
If the condition str(pdf_uuid) in index.list_items()
is not met, the 'docsearch' variable will not be defined, and you will encounter the UnboundLocalError when you try to use 'docsearch' later in your code.
To fix this, you should define 'docsearch' outside of the if-else condition. If the index already exists, you can load it using the from_existing_index
method:
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_existing_index(index_name, embeddings)
This way, 'docsearch' will be defined whether the index already exists or not, and you should no longer encounter the UnboundLocalError.
Please note that the Pinecone.from_existing_index
method is not currently available in LangChain. This is a hypothetical solution that assumes such a method exists. If it does not, you may need to modify your code to ensure that 'docsearch' is always defined before it is used.
For more information, you can refer to the LangChain Pinecone integration notebook and the LangChain Pinecone vector store source code.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
The docsearch outbound issue has been fixed. I am asking for further recommendations on using UUIDs. The whole reason I want to use UUIDs is that I can allow user to just point to a PDF document as a whole rather than just pieces of embeddings passing into LLMs. Give me some recommendations on how I can achieve this based on my current code:
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model
''')
add_vertical_space(5)
st.write('Made with β€οΈ by Zi')
def main(): st.header("Zi's PDF Helper: Chat with PDF")
# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)
# read PDF
if pdf is not None:
# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
pdf_reader = PdfReader(pdf)
# data = pdf_reader.load()
# split document into chunks
# also can use text split: good for PDFs that do not contains charts and visuals
sections = []
for page in pdf_reader.pages:
# Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
page_sections = page.extract_text().split('\n\n')
sections.extend(page_sections)
chunks = [Document(page_content=section) for section in sections]
# st.write(chunks)
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size = 500,
# chunk_overlap = 20
# )
# chunks = text_splitter.split_documents(data)
## embeddings
# Set up embeddings
embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
docsearch = None
try:
# Set up Pinecone
pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
index_name = 'langchainresearch'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536, metric="cosine")
index = pinecone.Index(index_name)
# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
print(f"An error occurred: {e}")
# Create or Load Chat History
if pdf:
# generate chat history
chat_history_file = f"{pdf.name}_chat_history.pkl"
# load history if exist
if os.path.exists(chat_history_file):
with open(chat_history_file, "rb") as f:
chat_history = pickle.load(f)
else:
chat_history = []
# Initialize chat_history in session_state if not present
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Check if 'prompt' is in session state
if 'last_input' not in st.session_state:
st.session_state.last_input = ''
# User Input
current_prompt = st.session_state.get('user_input', '')
prompt_placeholder = st.empty()
prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
submit_button = st.button("Submit")
if docsearch is not None and submit_button and prompt:
# Update the last input in session state
st.session_state.last_input = prompt
docs = docsearch.similarity_search(query=prompt, k=3)
#llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)
message = [
SystemMessage(content="You are a helpful assistant"),
HumanMessage(content=prompt)
]
chain = load_qa_chain(llm=chat, chain_type="stuff")
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=message)
print(cb)
# st.write(response)
st.write(docs)
# Process the response using AIMessage schema
# ai_message = AIMessage(content="AI message content")
# ai_message.content = response.generations[0].message.content
# Add to chat history
st.session_state.chat_history.append((prompt, response))
# Save chat history
with open(chat_history_file, "wb") as f:
pickle.dump(st.session_state.chat_history, f)
# Clear the input after processing
prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt")
# Display the entire chat
chat_content = ""
for user_msg, bot_resp in st.session_state.chat_history:
chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
st.markdown(chat_content, unsafe_allow_html=True)
if name == 'main': main()
π€
The error "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value" is likely due to the fact that the 'docsearch' variable is not defined in the scope where you are trying to use it.
In your code, 'docsearch' is defined within an if-else condition:
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
If the condition str(pdf_uuid) in index.list_items()
is not met, the 'docsearch' variable will not be defined, and you will encounter the UnboundLocalError when you try to use 'docsearch' later in your code.
To fix this, you should define 'docsearch' outside of the if-else condition. If the index already exists, you can load it using the from_existing_index
method:
if str(pdf_uuid) in index.list_items():
# Load the embeddings from Pinecone
embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
# Generate the embeddings and store them in Pinecone
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
index.upsert(items={str(pdf_uuid): embeddings})
docsearch = Pinecone.from_existing_index(index_name, embeddings)
This way, 'docsearch' will be defined whether the index already exists or not, and you should no longer encounter the UnboundLocalError.
Please note that the Pinecone.from_existing_index
method is not currently available in LangChain. This is a hypothetical solution that assumes such a method exists. If it does not, you may need to modify your code to ensure that 'docsearch' is always defined before it is used.
For more information, you can refer to the LangChain Pinecone integration notebook and the LangChain Pinecone vector store source code.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Issue you'd like to raise.
Hi there,
I am learning how to use Pinecone properly with LangChain and OpenAI Embedding. I built an application which can allow user upload PDFs and ask questions about the PDFs. In the application I used Pinecone as the vector database and store embeddings in Pinecone. However, I want to make change to my code so that whenever an user upload a PDF, the application can check if the PDF already been store as embeddings in Pinecone, and if yes, use the old embeddings; If no, then upload new embeddings.
Here is my code:
Imports
import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader
Streamlit - user interface
from streamlit_extras.add_vertical_space import add_vertical_space
Langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader
Pinecone
from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid
os.environ['OPENAI_API_KEY'] = apikey
User Interface
Side Bar
with st.sidebar: st.title('π Zi-GPT Version 2.0') st.markdown('''
About
Main Page
def main(): st.header("Zi's PDF Helper: Chat with PDF")
if name == 'main': main()
Give me some recommedations on what should I do or change.
Suggestion:
No response