Open DanielDaCosta opened 3 months ago
Here is also the screenshot from the API call:
I was able to replicate the above issue and located the cause of the error in /app/features/quizzify/core.py
self.vectorstore = self.vectorstore_class.from_documents(documents, self.embedding_model)
Found a discussion on the same issue https://github.com/chroma-core/chroma/issues/405. Couldn't find a solid workaround.
Will keep looking into this! Seems that the link @TekuriSaiAkhil referenced says that the collection is actually empty on creation. So possibly, the PDF file which was consumed for loading into Chroma vector store retrieved no such documents and tried to store empty values which resulted in the list index out of range?
I verified the documents
variable to ensure it isn't empty before passing it to self.vectorstore_class.from_documents(documents, self.embedding_model)
. After printing its contents, I confirmed that the list is not empty, so the issue doesn't seem to be related to the documents variable being empty.
Here is the code with the print statement and the documents
variable:
def create_vectorstore(self, documents: List[Document]):
if self.verbose:
logger.info(f"Creating vectorstore from {len(documents)} documents")
logger.info("Before self.vectorstore_class.from_documents")
print(documents)
self.vectorstore = self.vectorstore_class.from_documents(documents, self.embedding_model)
logger.info("AFTER self.vectorstore_class.from_documents")
if self.verbose: logger.info(f"Vectorstore created")
return self.vectorstore
Variable Document: document_output.txt
[Document(page_content='ImageNet Classification with Deep Convolutional\nNeural Networks\nAlex Krizhevsky\nUniversity of Toronto\nkriz@cs.utoronto.caIlya Sutskever\nUniversity of Toronto\nilya@cs.utoronto.caGeoffrey E. Hinton\nUniversity of Toronto\nhinton@cs.utoronto.ca\nAbstract\nWe trained a large, deep convolutional neural network to classify the 1.2 million\nhigh-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-\nferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%\nand 17.0% which is considerably better than the previous state-of-the-art. The\nneural network, which has 60 million parameters and 650,000 neurons, consists\nof five convolutional layers, some of which are followed by max-pooling layers,\nand three fully-connected layers with a final 1000-way softmax. To make train-\ning faster, we used non-saturating neurons and a very efficient GPU implemen-\ntation of the convolution operation. To reduce overfitting in the fully-connected', metadata={'source': 'pdf', 'page_number': 1}),
...
Document(page_content='22(2):511–538, 2010.\n9', metadata={'source': 'pdf', 'page_number': 9})]
I found that it takes 25 docs or less. So I use: self.vectorstore = self.vectorstore_class.from_documents(documents[:25], self.embedding_model)
@Dim314159 Thanks for that. It seems the issue is related to the number of pages, which might be a library limitation.
Hey @DanielDaCosta, based on this issue https://github.com/chroma-core/chroma/issues/405, i assumed that it was related to the number of images, vector graphics(tables etc..) present in the document. But it turns out that isn't the issue. Below I've given the code for PyMupdf, a multi-file loader using which I've tried loading a pdf with which the same issue occurs.
class BytesFilePDFLoader:
def __init__(self, files: List[Tuple[BytesIO, str]]):
self.files = files
def load(self) -> List[Document]:
documents = []
for file, file_type in self.files:
logger.debug(file_type)
if file_type.lower() == "pdf":
logger.info(file)
pdf_reader = pymupdf.open(stream=file,filetype=file_type)
for pages in range(pdf_reader.page_count):
page = pdf_reader.load_page(page_id=pages)
metadata = {"source" : file_type, "page_number" : pages + 1}
doc = Document(page_content=page.get_text(), metadata= metadata)
documents.append(doc)
else:
raise ValueError(f"Unsupported file type: {file_type}")
return documents
Hi @akashvenus
Have you tried your implementation with a smaller file (less than 20 pages)? Does it also return an error?
@DanielDaCosta small files don't seem to be an issue
I noticed that files which I could open before without this error, throws this for some reason. Is there something related to cache we have to clear ?
Hi @mikhailocampo
I'm trying to locally debug and reproduce the Kai application, but I've encountered an issue with the Quizzify feature. While everything runs successfully in Docker, I receive the following 500 error when attempting to use Quizzify:
Inputs
Output Logs:
Analysis
From my debugging, I discovered that the error occurs during the call
db = pipeline(files)
in the file/app/features/quizzify/core.py
.Solution
I'm still trying to pinpoint the exact cause of the error. Let me know if you have an idea.