mit-submit / A2rchi

An AI Augmented Research Chat Intelligence for MIT's subMIT project in the physics department
MIT License
9 stars 6 forks source link

HuggingFace embeddings unexpected behavior #134

Closed ludomori99 closed 1 year ago

ludomori99 commented 1 year ago

Two different error logs showed up (I was unable to understand the difference in the circumstance they'd arise).

Error 1: name='dev_collection_with_HuggingFaceEmbeddings' id=UUID('a7b56b9e-7cf9-4cbc-b529-4c09c11fdc6f') metadata=None Vectorstore needs to be updated Files to remove: [] Files to add: {'391446804850.html': '/root/data/websites/391446804850.html', '311908320295.html': '/root/data/websites/311908320295.html', '299931406278.html': '/root/data/websites/299931406278.html', '948537108847.html': '/root/data/websites/948537108847.html', '213679292299.html': '/root/data/websites/213679292299.html', '198006956929.html': '/root/data/websites/198006956929.html', '247039305341.html': '/root/data/websites/247039305341.html', '136901283757.html': '/root/data/websites/136901283757.html', '649811655156.html': '/root/data/websites/649811655156.html', '178898586411.html': '/root/data/websites/178898586411.html', '305701533622.html': '/root/data/websites/305701533622.html', '243114892764.html': '/root/data/websites/243114892764.html', '181001073821.html': '/root/data/websites/181001073821.html', '147197139266.html': '/root/data/websites/147197139266.html', '149446187468.html': '/root/data/websites/149446187468.html', '191440252181.html': '/root/data/websites/191440252181.html', '100971305273.html': '/root/data/websites/100971305273.html'} Ids: ['391446804850159290159290', '391446804850159440159440', '391446804850493563493563', '391446804850333332333332', '391446804850874070874070'] Created a chunk of size 1383, which is longer than the specified 1000 Created a chunk of size 1229, which is longer than the specified 1000 Created a chunk of size 1266, which is longer than the specified 1000 Ids: ['311908320295287408287408', '311908320295166536166536', '311908320295300922300922', '311908320295462437462437', '311908320295235417235417', '311908320295272625272625', '311908320295284394284394', '311908320295223642223642', '311908320295150568150568', '311908320295337206337206', '311908320295185710185710', '311908320295149215149215', '311908320295374140374140', '311908320295352770352770', '311908320295384027384027', '311908320295161701161701', '311908320295186736186736', '311908320295164042164042', '311908320295270457270457', '311908320295804874804874', '311908320295343918343918', '311908320295154588154588', '311908320295289953289953'] Traceback (most recent call last): File "/root/A2rchi/A2rchi/bin/service_mailbox.py", line 19, in <module> cleo = cleo.Cleo('Cleo_Helpdesk') File "/usr/local/lib/python3.10/site-packages/A2rchi/interfaces/cleo.py", line 76, in init self.ai_wrapper = CleoAIWrapper() File "/usr/local/lib/python3.10/site-packages/A2rchi/interfaces/cleo.py", line 27, in init self.data_manager.update_vectorstore() File "/usr/local/lib/python3.10/site-packages/A2rchi/utils/data_manager.py", line 155, in update_vectorstore collection = self._add_to_vectorstore(collection, files_to_add, sources) File "/usr/local/lib/python3.10/site-packages/A2rchi/utils/data_manager.py", line 242, in _add_to_vectorstore collection.add(embeddings=embeddings, ids=ids, documents=chunks, metadatas=metadatas) File "/usr/local/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 99, in add self._client._add(ids, self.id, embeddings, metadatas, documents) File "/usr/local/lib/python3.10/site-packages/chromadb/api/fastapi.py", line 340, in _add raise_chroma_error(resp) File "/usr/local/lib/python3.10/site-packages/chromadb/api/fastapi.py", line 465, in raise_chroma_error raise chroma_error chromadb.errors.InvalidCollectionException: Collection a7b56b9e-7cf9-4cbc-b529-4c09c11fdc6f does not exist.

Error 2: Traceback (most recent call last): File "/root/A2rchi/A2rchi/bin/service_chat.py", line 37, in <module> app = FlaskAppWrapper(Flask( File "/usr/local/lib/python3.10/site-packages/A2rchi/interfaces/chat_app/app.py", line 200, in __init__ self.chat = ChatWrapper() File "/usr/local/lib/python3.10/site-packages/A2rchi/interfaces/chat_app/app.py", line 40, in __init__ self.data_manager.update_vectorstore() File "/usr/local/lib/python3.10/site-packages/A2rchi/utils/data_manager.py", line 154, in update_vectorstore collection = self._add_to_vectorstore(collection, files_to_add, sources) File "/usr/local/lib/python3.10/site-packages/A2rchi/utils/data_manager.py", line 208, in _add_to_vectorstore docs = loader.load() File "/usr/local/lib/python3.10/site-packages/langchain/document_loaders/html_bs.py", line 48, in load with open(self.file_path, "r", encoding=self.open_encoding) as f: FileNotFoundError: [Errno 2] No such file or directory: '/root/data/websites/100971305273.html'

It is hard to test and look into the container during execution to try and see what is happening at the data volume because the docker image is restarted every few seconds as a consequence of the failure. My impression is that we hardcoded the OpenAI embeddings somehow in some PR in the past couple of weeks.