ZinanYang1995 commented 10 months ago

Issue you'd like to raise.

Hi there,

I am learning how to use Pinecone properly with LangChain and OpenAI Embedding. I built an application which can allow user upload PDFs and ask questions about the PDFs. In the application I used Pinecone as the vector database and store embeddings in Pinecone. However, I want to make change to my code so that whenever an user upload a PDF, the application can check if the PDF already been store as embeddings in Pinecone, and if yes, use the old embeddings; If no, then upload new embeddings.

Here is my code:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")  # Adjust the dimension as per your embeddings
        index = pinecone.Index(index_name)

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        # st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

Give me some recommedations on what should I do or change.

Suggestion:

No response

dosubot[bot] commented 10 months ago

🤖

Hello again, @ZinanYang1995! It's great to see you diving deeper into the world of Pinecone and LangChain. I'm here to assist you, as always. Let's take a look at your code.

Based on your code and the requirements you've mentioned, you can modify your code to check if a PDF has already been stored as embeddings in Pinecone. Here's a general approach:

Generate a unique identifier for each PDF. This could be a hash of the file's contents, or a UUID if the same PDF can be uploaded multiple times and should be treated as a different document each time.
Before adding the PDF's embeddings to Pinecone, check if an embedding with the same identifier already exists in Pinecone.
If the embedding already exists, use the existing embedding. If not, generate a new embedding and add it to Pinecone.

Here's a rough example of how you might implement this:

import hashlib

# Generate a unique identifier for the PDF
pdf_content = pdf.read()
pdf_hash = hashlib.sha256(pdf_content).hexdigest()

# Check if the PDF's embeddings already exist in Pinecone
if pdf_hash in index:
    # The embeddings already exist, so use them
    embeddings = index[pdf_hash]
else:
    # The embeddings don't exist, so generate them and add them to Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed(pdf_content)
    index[pdf_hash] = embeddings

Please note that this is a simplified example and may not work as-is. You'll need to adapt it to fit your specific use case and handle potential errors.

Also, please note that this approach assumes that Pinecone allows you to check if an embedding with a specific identifier exists, and to retrieve an existing embedding by its identifier. You would need to consult Pinecone's documentation or API reference to confirm this and to see how to implement these operations.

As for handling PDF files in LangChain, it uses various loaders to load and process PDF documents. The specific way embeddings are handled would depend on the specific loader and the implementation of its methods. For example, you're using the UnstructuredPDFLoader in your code to load the PDF file and return a Document object, which can then be used for further processing.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ZinanYang1995 commented 10 months ago

@dosu-bot

Thank you for your response. I want to know how can I achieve the same thing using the UUID method to check if the PDF embedding is already exist in Pinecone. Could you give me examples and show me what I need to change in my code?

Here is my code so far:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")  # Adjust the dimension as per your embeddings
        index = pinecone.Index(index_name)

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        # st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

dosubot[bot] commented 10 months ago

🤖

To implement the functionality you described, you can generate a unique identifier (UUID) for each PDF and use it as a key to store and retrieve the embeddings from Pinecone. Here's how you can modify your code:

Generate a UUID for each PDF:

# Generate a UUID for the PDF
pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)

Check if the embeddings for the PDF already exist in Pinecone:

# Check if the embeddings already exist in Pinecone
if str(pdf_uuid) in index.list_items():
    # Load the embeddings from Pinecone
    embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
    # Generate the embeddings and store them in Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
    index.upsert(items={str(pdf_uuid): embeddings})

Here's how your modified code would look:

## Imports
# ...

## User Interface
# Side Bar
# ...

# Main Page
def main():
    # ...

    # upload a PDF file
    pdf = st.file_uploader("Please upload your PDF here", type='pdf')

    # read PDF
    if pdf is not None: 
        # Generate a UUID for the PDF
        pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)

        # ...

        try:
            # Set up Pinecone
            # ...

            # Check if the embeddings already exist in Pinecone
            if str(pdf_uuid) in index.list_items():
                # Load the embeddings from Pinecone
                embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
            else:
                # Generate the embeddings and store them in Pinecone
                embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
                index.upsert(items={str(pdf_uuid): embeddings})

            docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

        # ...

if __name__ == '__main__':
    main()

This way, you avoid generating embeddings for the same PDF multiple times, which can save a lot of computational resources. Please note that this approach assumes that the content of a PDF does not change if its name remains the same. If a PDF's content can change while its name remains the same, you might want to consider using a hash of the PDF's content instead of its name to generate the UUID.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ZinanYang1995 commented 9 months ago

@dosu-bot

I updated the code with the changes above but got a error message I cannot solve. The error message is this:

UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value Traceback: File "C:\Users\zy73\AppData\Roaming\Python\Python311\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 552, in _run_script exec(code, module.dict) File "C:\Users\zy73\OneDrive\Desktop\AI Research\langchain\pdf.py", line 175, in main() File "C:\Users\zy73\OneDrive\Desktop\AI Research\langchain\pdf.py", line 133, in main if docsearch is not None and submit_button and prompt: ^^^^^^^^^

Here is my code:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    # Generate a UUID for the PDF
    pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")
        index = pinecone.Index(index_name)

         # Check if the embeddings already exist in Pinecone
        if str(pdf_uuid) in index.list_items():
            # Load the embeddings from Pinecone
            embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
        else:
            # Generate the embeddings and store them in Pinecone
            embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
            index.upsert(items={str(pdf_uuid): embeddings})

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

dosubot[bot] commented 9 months ago

🤖

From the error message you provided, it seems like the variable docsearch is not defined in the scope where it's being used. This is causing the UnboundLocalError. To resolve this, you need to ensure that docsearch is defined before it's used in your code.

However, without seeing the rest of your code, it's hard to provide a specific solution. The docsearch variable is not a part of the LangChain framework based on the repository information provided, so it's likely a variable specific to your application.

Regarding your question about checking if a PDF embedding already exists in Pinecone, you can use the CacheBackedEmbeddings class in LangChain. This class has a method embed_documents which first checks the cache for the embeddings. If the embeddings are not found, the method uses the underlying embedder to embed the documents and stores the results in the cache.

Here is an example of how to use it:

from langchain.embeddings import CacheBackedEmbeddings, OpenAIEmbeddings
from langchain.storage import LocalFileStore

store = LocalFileStore('./my_cache')

underlying_embedder = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embedder, store, namespace=underlying_embedder.model
)

# Embedding is computed and cached
embeddings = embedder.embed_documents(["hello", "goodbye"])

# Embeddings are retrieved from the cache, no computation is done
embeddings = embedder.embed_documents(["hello", "goodbye"])

In this example, the embed_documents method is called twice. The first time, the embeddings are computed and cached. The second time, the embeddings are retrieved from the cache, and no computation is done. If the embeddings for the specified documents did not exist in the cache, they would be computed and stored in the cache.

For the UUID method, you can use the uuid library in Python. Here is an example of how to generate a UUID:

import uuid

# Generate a random UUID
my_uuid = uuid.uuid4()
print(my_uuid)

You can use this UUID as a unique identifier for your PDF files. When you want to check if a PDF has already been stored, you can look up this UUID in your Pinecone vector store.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ZinanYang1995 commented 9 months ago

@dosu-bot

I have the UUID method built into my code. But I have got the error message File "C:\Users\zy73\OneDrive\Desktop\AI Research\langchain\pdf.py", line 133, in main if docsearch is not None and submit_button and prompt: ^^^^^^^^^ UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value.

Please help me debug this issue and show we what should be fixed in my code:

Here is my entire code so far:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    # Generate a UUID for the PDF
    pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")
        index = pinecone.Index(index_name)

         # Check if the embeddings already exist in Pinecone
        if str(pdf_uuid) in index.list_items():
            # Load the embeddings from Pinecone
            embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
        else:
            # Generate the embeddings and store them in Pinecone
            embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
            index.upsert(items={str(pdf_uuid): embeddings})

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

dosubot[bot] commented 9 months ago

🤖

The error you're encountering, "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value", is typically raised when a local variable is referenced before it has been assigned a value. In your code, the variable 'docsearch' is defined within the try block. If an exception occurs before 'docsearch' is defined, it will not be assigned a value, and you will encounter this error when you try to access it later.

To resolve this issue, you should initialize 'docsearch' before the try block to ensure it has a value in the event of an exception. Here's how you can modify your code:

docsearch = None  # Initialize docsearch

try:
    # Your existing code...
    docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
except Exception as e:
    print(f"An error occurred: {e}")

Now, 'docsearch' is guaranteed to have a value (even if it's None) when you try to access it later in your code.

Regarding the line if docsearch is not None and submit_button and prompt:, it seems like it's not included in the code snippet you provided. If this line is causing an error, please ensure that 'submit_button' and 'prompt' are also defined before this line is executed.

For more information about the UnboundLocalError, you can refer to the Python documentation. If you have any more questions or need further clarification, feel free to ask.

Sources

libs/langchain/tests/integration_tests/vectorstores/test_pinecone.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ZinanYang1995 commented 9 months ago

@dosu-bot

Now I am using UUID in my code determine whether embeddings have already been created for certain PDF. How can I use this UUID paremeter in my LLM calls so that user can call the entire PDF using UUID if needed? Give me example of where I can change my code.

Here is my code so far:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    # Generate a UUID for the PDF
    pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
    docsearch = None

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")
        index = pinecone.Index(index_name)

         # Check if the embeddings already exist in Pinecone
        if str(pdf_uuid) in index.list_items():
            # Load the embeddings from Pinecone
            embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
        else:
            # Generate the embeddings and store them in Pinecone
            embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
            index.upsert(items={str(pdf_uuid): embeddings})

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

dosubot[bot] commented 9 months ago

🤖

The error "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value" is likely due to the fact that the 'docsearch' variable is not defined in the scope where you are trying to use it.

In your code, 'docsearch' is defined within an if-else condition:

if str(pdf_uuid) in index.list_items():
    # Load the embeddings from Pinecone
    embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
    # Generate the embeddings and store them in Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
    index.upsert(items={str(pdf_uuid): embeddings})

docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

If the condition str(pdf_uuid) in index.list_items() is not met, the 'docsearch' variable will not be defined, and you will encounter the UnboundLocalError when you try to use 'docsearch' later in your code.

To fix this, you should define 'docsearch' outside of the if-else condition. If the index already exists, you can load it using the from_existing_index method:

if str(pdf_uuid) in index.list_items():
    # Load the embeddings from Pinecone
    embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
    # Generate the embeddings and store them in Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
    index.upsert(items={str(pdf_uuid): embeddings})

docsearch = Pinecone.from_existing_index(index_name, embeddings)

This way, 'docsearch' will be defined whether the index already exists or not, and you should no longer encounter the UnboundLocalError.

Please note that the Pinecone.from_existing_index method is not currently available in LangChain. This is a hypothetical solution that assumes such a method exists. If it does not, you may need to modify your code to ensure that 'docsearch' is always defined before it is used.

For more information, you can refer to the LangChain Pinecone integration notebook and the LangChain Pinecone vector store source code.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ZinanYang1995 commented 9 months ago

@dosu-bot

The docsearch outbound issue has been fixed. I am asking for further recommendations on using UUIDs. The whole reason I want to use UUIDs is that I can allow user to just point to a PDF document as a whole rather than just pieces of embeddings passing into LLMs. Give me some recommendations on how I can achieve this based on my current code:

Imports

import streamlit as st import os from apikey import apikey import pickle from PyPDF2 import PdfReader

Streamlit - user interface

from streamlit_extras.add_vertical_space import add_vertical_space

Langchain

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.callbacks import get_openai_callback from langchain.chat_models.openai import ChatOpenAI from langchain.schema import (SystemMessage, HumanMessage, AIMessage) from langchain.schema import Document from langchain.document_loaders import UnstructuredPDFLoader

Pinecone

from langchain.vectorstores import Pinecone import pinecone from apikey import pinecone_api_key import uuid

os.environ['OPENAI_API_KEY'] = apikey

User Interface

Side Bar

with st.sidebar: st.title('🚀 Zi-GPT Version 2.0') st.markdown('''

About

This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [OpenAI](https://platform.openai.com/docs/models) LLM model

''')
add_vertical_space(5)
st.write('Made with ❤️ by Zi')

Main Page

def main(): st.header("Zi's PDF Helper: Chat with PDF")

# upload a PDF file
pdf = st.file_uploader("Please upload your PDF here", type='pdf')
# st.write(pdf)

# read PDF
if pdf is not None: 
    # Generate a UUID for the PDF
    pdf_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, pdf.name)
    pdf_reader = PdfReader(pdf)
    # data = pdf_reader.load()

    # split document into chunks
    # also can use text split: good for PDFs that do not contains charts and visuals
    sections = []
    for page in pdf_reader.pages:
        # Split the page text by paragraphs (assuming two newlines indicate a new paragraph)
        page_sections = page.extract_text().split('\n\n')
        sections.extend(page_sections)

    chunks = [Document(page_content=section) for section in sections]
    # st.write(chunks)

    # text_splitter = RecursiveCharacterTextSplitter(
        # chunk_size = 500,
        # chunk_overlap = 20
    # )

    # chunks = text_splitter.split_documents(data)

    ## embeddings
    # Set up embeddings
    embeddings = OpenAIEmbeddings( model="text-embedding-ada-002")
    docsearch = None

    try:
        # Set up Pinecone
        pinecone.init(api_key=pinecone_api_key, environment='gcp-starter')
        index_name = 'langchainresearch'
        if index_name not in pinecone.list_indexes():
            pinecone.create_index(index_name, dimension=1536, metric="cosine")
        index = pinecone.Index(index_name)

         # Check if the embeddings already exist in Pinecone
        if str(pdf_uuid) in index.list_items():
            # Load the embeddings from Pinecone
            embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
        else:
            # Generate the embeddings and store them in Pinecone
            embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
            index.upsert(items={str(pdf_uuid): embeddings})

        docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

    except Exception as e:
        print(f"An error occurred: {e}")

    # Create or Load Chat History
    if pdf:
        # generate chat history
        chat_history_file = f"{pdf.name}_chat_history.pkl"

        # load history if exist
        if os.path.exists(chat_history_file):
            with open(chat_history_file, "rb") as f:
                chat_history  = pickle.load(f)
        else:
            chat_history = []

    # Initialize chat_history in session_state if not present
    if 'chat_history' not in st.session_state:
        st.session_state.chat_history = []

    # Check if 'prompt' is in session state
    if 'last_input' not in st.session_state:
        st.session_state.last_input = ''

    # User Input
    current_prompt = st.session_state.get('user_input', '')

    prompt_placeholder = st.empty()
    prompt = prompt_placeholder.text_area("Ask questions about your PDF:", value=current_prompt, placeholder="Send a message", key="user_input")
    submit_button = st.button("Submit")

    if docsearch is not None and submit_button and prompt:
        # Update the last input in session state
        st.session_state.last_input = prompt

        docs = docsearch.similarity_search(query=prompt, k=3)

        #llm = OpenAI(temperature=0.9, model_name='gpt-3.5-turbo')
        chat = ChatOpenAI(model='gpt-4', temperature=0.7, max_tokens=3000)

        message = [
            SystemMessage(content="You are a helpful assistant"),
            HumanMessage(content=prompt)
        ]
        chain = load_qa_chain(llm=chat, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=message)
            print(cb)
        # st.write(response)
        st.write(docs)

        # Process the response using AIMessage schema
        # ai_message = AIMessage(content="AI message content")
        # ai_message.content = response.generations[0].message.content

        # Add to chat history
        st.session_state.chat_history.append((prompt, response))

        # Save chat history
        with open(chat_history_file, "wb") as f:
            pickle.dump(st.session_state.chat_history, f)

        # Clear the input after processing
        prompt_placeholder.text_area("Ask questions about your PDF:", value='', placeholder="Send a message", key="pdf_prompt") 

    # Display the entire chat 
    chat_content = ""
    for user_msg, bot_resp in st.session_state.chat_history:
        chat_content += f"<div style='background-color: #222222; color: white; padding: 10px;'>**You:** {user_msg}</div>"
        chat_content += f"<div style='background-color: #333333; color: white; padding: 10px;'>**Zi GPT:** {bot_resp}</div>"
    st.markdown(chat_content, unsafe_allow_html=True)

if name == 'main': main()

dosubot[bot] commented 9 months ago

🤖

The error "UnboundLocalError: cannot access local variable 'docsearch' where it is not associated with a value" is likely due to the fact that the 'docsearch' variable is not defined in the scope where you are trying to use it.

In your code, 'docsearch' is defined within an if-else condition:

if str(pdf_uuid) in index.list_items():
    # Load the embeddings from Pinecone
    embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
    # Generate the embeddings and store them in Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
    index.upsert(items={str(pdf_uuid): embeddings})

docsearch = Pinecone.from_documents(chunks, embeddings, index_name = index_name)

If the condition str(pdf_uuid) in index.list_items() is not met, the 'docsearch' variable will not be defined, and you will encounter the UnboundLocalError when you try to use 'docsearch' later in your code.

To fix this, you should define 'docsearch' outside of the if-else condition. If the index already exists, you can load it using the from_existing_index method:

if str(pdf_uuid) in index.list_items():
    # Load the embeddings from Pinecone
    embeddings = index.fetch(ids=[str(pdf_uuid)])[str(pdf_uuid)]
else:
    # Generate the embeddings and store them in Pinecone
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002").embed_documents(chunks)
    index.upsert(items={str(pdf_uuid): embeddings})

docsearch = Pinecone.from_existing_index(index_name, embeddings)

This way, 'docsearch' will be defined whether the index already exists or not, and you should no longer encounter the UnboundLocalError.

Please note that the Pinecone.from_existing_index method is not currently available in LangChain. This is a hypothetical solution that assumes such a method exists. If it does not, you may need to modify your code to ensure that 'docsearch' is always defined before it is used.

For more information, you can refer to the LangChain Pinecone integration notebook and the LangChain Pinecone vector store source code.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain

Issue: Embedding with Pinecone #15497

Issue you'd like to raise.

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Suggestion:

Sources

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Sources

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Sources

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Sources

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Sources

Imports

Streamlit - user interface

Langchain

Pinecone

User Interface

Side Bar

About

Main Page

Sources