radicalxdev / kai-ai-backend

This is the Kai Teaching Assistant ai repo.
MIT License
12 stars 46 forks source link

ERROR - Encountered error in executing tool: Error in executor: list index out of range #35

Open DanielDaCosta opened 3 weeks ago

DanielDaCosta commented 3 weeks ago

Hi @mikhailocampo

I'm trying to locally debug and reproduce the Kai application, but I've encountered an issue with the Quizzify feature. While everything runs successfully in Docker, I receive the following 500 error when attempting to use Quizzify:

Inputs

Output Logs:

2024-06-12 18:16:53,650 - api.tool_utilities - DEBUG - Loading tool metadata for tool_id: 0
2024-06-12 18:16:53,650 - api.tool_utilities - DEBUG - Checking metadata file at: /app/features/quizzify/metadata.json
2024-06-12 18:16:53,652 - api.tool_utilities - DEBUG - Loaded metadata: {'inputs': [{'label': 'Topic', 'name': 'topic', 'type': 'text'}, {'label': 'Number of Questions', 'name': 'num_questions', 'type': 'number'}, {'label': 'Upload PDF files', 'name': 'files', 'type': 'file'}]}
2024-06-12 18:16:54,839 - services.logger - DEBUG - Files: [ToolFile(filePath=None, url='https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf', filename=None)]
2024-06-12 18:16:55,490 - features.quizzify.tools - INFO - Completed pipeline compilation
2024-06-12 18:16:55,490 - features.quizzify.tools - INFO - Executing pipeline
2024-06-12 18:16:55,490 - features.quizzify.tools - INFO - Start of Pipeline received: 1 documents of type <class 'services.tool_registry.ToolFile'>
2024-06-12 18:16:55,491 - features.quizzify.tools - INFO - Loading 1 files
2024-06-12 18:16:55,491 - features.quizzify.tools - INFO - Loader type used: <class 'features.quizzify.tools.URLLoader'>
2024-06-12 18:16:55,492 - features.quizzify.tools - DEBUG - Loader is a: <class 'features.quizzify.tools.URLLoader'>
2024-06-12 18:16:55,893 - features.quizzify.tools - INFO - Successfully loaded file from https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
2024-06-12 18:16:55,896 - features.quizzify.tools - DEBUG - pdf
2024-06-12 18:16:56,075 - features.quizzify.tools - INFO - Loaded 9 documents
2024-06-12 18:16:56,075 - features.quizzify.tools - INFO - Splitting 9 documents
2024-06-12 18:16:56,075 - features.quizzify.tools - INFO - Splitter type used: <class 'langchain_text_splitters.character.RecursiveCharacterTextSplitter'>
2024-06-12 18:16:56,076 - features.quizzify.tools - INFO - Split 9 documents into 43 chunks
2024-06-12 18:16:56,076 - features.quizzify.tools - INFO - Creating vectorstore from 43 documents
2024-06-12 18:16:57,350 - services.logger - ERROR - Error in executor: list index out of range
2024-06-12 18:16:57,350 - api.tool_utilities - ERROR - Encountered error in executing tool: Error in executor: list index out of range
2024-06-12 18:16:57,350 - api.router - ERROR - HTTPException: 500: Error in executor: list index out of range

INFO:     192.168.65.1:30964 - "POST /submit-tool HTTP/1.1" 500 Internal Server Error

Analysis

From my debugging, I discovered that the error occurs during the call db = pipeline(files) in the file /app/features/quizzify/core.py.

Solution

I'm still trying to pinpoint the exact cause of the error. Let me know if you have an idea.

DanielDaCosta commented 3 weeks ago

Here is also the screenshot from the API call:

Screenshot 2024-06-12 at 11 30 21 AM

TekuriSaiAkhil commented 3 weeks ago

I was able to replicate the above issue and located the cause of the error in /app/features/quizzify/core.py

self.vectorstore = self.vectorstore_class.from_documents(documents, self.embedding_model)

Found a discussion on the same issue https://github.com/chroma-core/chroma/issues/405. Couldn't find a solid workaround.

mikhailocampo commented 3 weeks ago

Will keep looking into this! Seems that the link @TekuriSaiAkhil referenced says that the collection is actually empty on creation. So possibly, the PDF file which was consumed for loading into Chroma vector store retrieved no such documents and tried to store empty values which resulted in the list index out of range?

DanielDaCosta commented 2 weeks ago

I verified the documents variable to ensure it isn't empty before passing it to self.vectorstore_class.from_documents(documents, self.embedding_model). After printing its contents, I confirmed that the list is not empty, so the issue doesn't seem to be related to the documents variable being empty.

Here is the code with the print statement and the documents variable:

    def create_vectorstore(self, documents: List[Document]):
        if self.verbose:
            logger.info(f"Creating vectorstore from {len(documents)} documents")

        logger.info("Before self.vectorstore_class.from_documents")
        print(documents)

        self.vectorstore = self.vectorstore_class.from_documents(documents, self.embedding_model)
        logger.info("AFTER self.vectorstore_class.from_documents")

        if self.verbose: logger.info(f"Vectorstore created")
        return self.vectorstore

Variable Document: document_output.txt

[Document(page_content='ImageNet Classification with Deep Convolutional\nNeural Networks\nAlex Krizhevsky\nUniversity of Toronto\nkriz@cs.utoronto.caIlya Sutskever\nUniversity of Toronto\nilya@cs.utoronto.caGeoffrey E. Hinton\nUniversity of Toronto\nhinton@cs.utoronto.ca\nAbstract\nWe trained a large, deep convolutional neural network to classify the 1.2 million\nhigh-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-\nferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%\nand 17.0% which is considerably better than the previous state-of-the-art. The\nneural network, which has 60 million parameters and 650,000 neurons, consists\nof five convolutional layers, some of which are followed by max-pooling layers,\nand three fully-connected layers with a final 1000-way softmax. To make train-\ning faster, we used non-saturating neurons and a very efficient GPU implemen-\ntation of the convolution operation. To reduce overfitting in the fully-connected', metadata={'source': 'pdf', 'page_number': 1}),
...
Document(page_content='22(2):511–538, 2010.\n9', metadata={'source': 'pdf', 'page_number': 9})]
Dim314159 commented 2 weeks ago

I found that it takes 25 docs or less. So I use: self.vectorstore = self.vectorstore_class.from_documents(documents[:25], self.embedding_model)

DanielDaCosta commented 2 weeks ago

@Dim314159 Thanks for that. It seems the issue is related to the number of pages, which might be a library limitation.

akashvenus commented 1 week ago

Hey @DanielDaCosta, based on this issue https://github.com/chroma-core/chroma/issues/405, i assumed that it was related to the number of images, vector graphics(tables etc..) present in the document. But it turns out that isn't the issue. Below I've given the code for PyMupdf, a multi-file loader using which I've tried loading a pdf with which the same issue occurs.

class BytesFilePDFLoader:
    def __init__(self, files: List[Tuple[BytesIO, str]]):
        self.files = files

    def load(self) -> List[Document]:
        documents = []

        for file, file_type in self.files:
            logger.debug(file_type)
            if file_type.lower() == "pdf":
                logger.info(file)
                pdf_reader = pymupdf.open(stream=file,filetype=file_type)
                for pages in range(pdf_reader.page_count):
                    page = pdf_reader.load_page(page_id=pages)
                    metadata = {"source" : file_type, "page_number" : pages + 1}
                    doc = Document(page_content=page.get_text(), metadata= metadata)
                    documents.append(doc)

            else:
                raise ValueError(f"Unsupported file type: {file_type}")

        return documents
DanielDaCosta commented 1 week ago

Hi @akashvenus

Have you tried your implementation with a smaller file (less than 20 pages)? Does it also return an error?

akashvenus commented 1 week ago

@DanielDaCosta small files don't seem to be an issue

akashvenus commented 1 week ago

I noticed that files which I could open before without this error, throws this for some reason. Is there something related to cache we have to clear ?