techleadhd / chatgpt-retrieval

1.65k stars 786 forks source link

Vector/Element Issues #16

Closed southparkkids closed 12 months ago

southparkkids commented 12 months ago

Hello, I keep having issues when trying to run this.

I am trying to train the model using many manuals (23 manuals that I have converted to txt files).

Traceback (most recent call last): File "c:\Users\rschmidt\Desktop\ChatGPT Retrieval\chatgpt.py", line 33, in <module> index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"persist"}).from_loaders([loader]) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\indexes\vectorstore.py", line 73, in from_loaders return self.from_documents(docs) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\indexes\vectorstore.py", line 78, in from_documents vectorstore = self.vectorstore_cls.from_documents( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 462, in from_documents return cls.from_texts( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 430, in from_texts chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 150, in add_texts self._collection.upsert( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\api\models\Collection.py", line 299, in upsert self._client._upsert( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\api\local.py", line 318, in _upsert self._add( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\api\local.py", line 260, in _add self._db.add_incremental(collection_id, added_uuids, embeddings) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\db\clickhouse.py", line 639, in add_incremental index.add(ids, embeddings) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\db\index\hnswlib.py", line 177, in add self._index.add_items(embeddings, labels) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4634,) + inhomogeneous part.

southparkkids commented 12 months ago

If I reuse an Index I get the error below as well.

`Reusing index...

Prompt: Please tell me a little about CSD (Cloud Suite Distribution) Traceback (most recent call last): File "c:\Users\rschmidt\Desktop\ChatGPT Retrieval\chatgpt.py", line 48, in result = chain({"question": query, "chat_history": chat_history}) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 166, in call raise e File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\base.py", line 160, in call self._call(inputs, run_manager=run_manager) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\conversational_retrieval\base.py", line 121, in _call docs = self._get_docs(new_question, inputs, run_manager=_run_manager) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\chains\conversational_retrieval\base.py", line 224, in _get_docs docs = self.retriever.get_relevant_documents( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\schema\retriever.py", line 139, in get_relevant_documents raise e File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\schema\retriever.py", line 132, in get_relevant_documents result = self._get_relevant_documents( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\base.py", line 413, in _get_relevant_documents docs = self.vectorstore.similarity_search(query, *self.search_kwargs) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 172, in similarity_search docs_and_scores = self.similarity_search_with_score(query, k, filter=filter) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 220, in similarity_search_with_score results = self.__query_collection( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\utils.py", line 53, in wrapper return func(args, **kwargs) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\vectorstores\chroma.py", line 119, in __query_collection return self._collection.query( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\api\models\Collection.py", line 223, in query return self._client._query( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\api\local.py", line 457, in _query uuids, distances = self._db.get_nearest_neighbors( File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\db\clickhouse.py", line 613, in get_nearest_neighbors uuids, distances = index.get_nearest_neighbors(embeddings, n_results, ids) File "C:\Users\rschmidt\AppData\Local\Programs\Python\Python310\lib\site-packages\chromadb\db\index\hnswlib.py", line 296, in get_nearest_neighbors database_labels, distances = self._index.knn_query( RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small`

southparkkids commented 12 months ago

From pip freeze

aiohttp==3.8.4 aiosignal==1.3.1 anyio==3.7.0 argilla==1.12.0 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 certifi==2023.5.7 cffi==1.15.1 chardet==5.1.0 charset-normalizer==3.1.0 chromadb==0.3.26 click==8.1.3 clickhouse-connect==0.6.4 colorama==0.4.6 coloredlogs==15.0.1 commonmark==0.9.1 cryptography==41.0.1 dataclasses-json==0.5.9 Deprecated==1.2.14 duckdb==0.8.1 et-xmlfile==1.1.0 exceptiongroup==1.1.2 fastapi==0.99.1 filetype==1.2.0 flatbuffers==23.5.26 frozenlist==1.3.3 greenlet==2.0.2 h11==0.14.0 hnswlib==0.7.0 httpcore==0.16.3 httptools==0.5.0 httpx==0.23.3 humanfriendly==10.0 idna==3.4 importlib-metadata==6.7.0 joblib==1.3.1 langchain==0.0.222 langchainplus-sdk==0.0.19 lxml==4.9.2 lz4==4.3.2 Markdown==3.4.3 marshmallow==3.19.0 marshmallow-enum==1.5.1 monotonic==1.6 mpmath==1.3.0 msg-parser==1.2.0 multidict==6.0.4 mypy-extensions==1.0.0 nltk==3.8.1 numexpr==2.8.4 numpy==1.24.0 olefile==0.46 onnxruntime==1.15.1 openai==0.27.8 openapi-schema-pydantic==1.2.4 openpyxl==3.1.2 overrides==7.3.1 packaging==23.1 pandas==1.5.3 pdf2image==1.16.3 pdfminer.six==20221105 Pillow==10.0.0 posthog==3.0.1 protobuf==4.23.3 pulsar-client==3.2.0 pycparser==2.21 pydantic==1.10.10 Pygments==2.15.1 pypandoc==1.11 pyreadline3==3.4.1 python-dateutil==2.8.2 python-docx==0.8.11 python-dotenv==1.0.0 python-magic==0.4.27 python-pptx==0.6.21 pytz==2023.3 PyYAML==6.0 regex==2023.6.3 requests==2.31.0 rfc3986==1.5.0 rich==13.0.1 six==1.16.0 sniffio==1.3.0 SQLAlchemy==2.0.17 starlette==0.27.0 sympy==1.12 tabulate==0.9.0 tenacity==8.2.2 tiktoken==0.4.0 tokenizers==0.13.3 tqdm==4.65.0 typer==0.7.0 typing-inspect==0.9.0 typing_extensions==4.7.1 unstructured==0.7.12 urllib3==2.0.3 uvicorn==0.22.0 watchfiles==0.19.0 websockets==11.0.3 wrapt==1.14.1 xlrd==2.0.1 XlsxWriter==3.1.2 yarl==1.9.2 zipp==3.15.0 zstandard==0.21.0

southparkkids commented 12 months ago

sxe_2023.x_csdso__en-us.txt sxe_2023.x_icolh__en-us.txt sxe_2023.x_glolh__en-us.txt sxe_2023.x_kpolh__en-us.txt sxe_2023.x_oeolh__en-us.txt sxe_2023.x_pdolh__en-us.txt sxe_2023.x_poolh__en-us.txt sxe_2023.x_saolh__en-us.txt sxe_2023.x_sxe_industry_content__en-us.txt sxe_2023.x_sxgdprag__en-us.txt sxe_2023.x_sxmmug__en-us.txt sxe_2023.x_sxug__en-us.txt sxe_2023.x_twlhandug__en-us.txt sxe_2023.x_twlmgrug__en-us.txt sxe_2023.x_twlpickug__en-us.txt sxe_2023.x_twlrecvug__en-us.txt sxe_2023.x_twlsag__en-us.txt sxe_2023.x_vaolh__en-us.txt sxe_2023.x_wmmug__en-us.txt sxe_2023.x_wtolh__en-us.txt sxe_2023.x_apolh__en-us.txt sxe_2023.x_arolh__en-us.txt sxe_2023.x_edolh__en-us.txt It appears that it was just too much data I believe. Here are all of the files I am trying to get it to train using.

southparkkids commented 12 months ago

I am going to try pinning down to which files work and which don't. I believe it is either too much data or a character or string of characters that aren't playing well.

I found it was able to take all the text files after I reconverted them, but I converted them to plaintext text files instead of the files I uploaded previously. I hope this helps someone.