snexus / llm-search

Querying local documents, powered by LLM
MIT License
421 stars 51 forks source link

CSV data parsing #103

Closed mohammad-yousuf closed 3 months ago

mohammad-yousuf commented 3 months ago

Hi @snexus,

Is it possible to work with CSV/SQL data. Since you have mentioned unstructured supported formats which includes csv as well. I am trying to parse csv but getting errors:

`Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/langchain_community/vectorstores/chroma.py", line 297, in add_texts self._collection.upsert( File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 477, in upsert ) = self._validate_embedding_set( File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 554, in _validate_embedding_set validate_metadatas(maybe_cast_one_to_many_metadata(metadatas)) File "/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py", line 310, in validate_metadatas validate_metadata(metadata) File "/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py", line 278, in validate_metadata raise ValueError( ValueError: Expected metadata value to be a str, int, float or bool, got None which is a <class 'NoneType'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/llmsearch", line 8, in sys.exit(main_cli()) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/llmsearch/cli.py", line 44, in generate_index create_embeddings(config, vs) File "/usr/local/lib/python3.10/dist-packages/llmsearch/embeddings.py", line 80, in create_embeddings vs.create_index_from_documents(all_docs=all_docs) File "/usr/local/lib/python3.10/dist-packages/llmsearch/chroma.py", line 66, in create_index_from_documents vectordb = Chroma.from_documents( File "/usr/local/lib/python3.10/dist-packages/langchain_community/vectorstores/chroma.py", line 778, in from_documents return cls.from_texts( File "/usr/local/lib/python3.10/dist-packages/langchain_community/vectorstores/chroma.py", line 736, in from_texts chroma_collection.add_texts( File "/usr/local/lib/python3.10/dist-packages/langchain_community/vectorstores/chroma.py", line 309, in add_texts raise ValueError(e.args[0] + "\n\n" + msg) ValueError: Expected metadata value to be a str, int, float or bool, got None which is a <class 'NoneType'>

Try filtering complex metadata from the document using langchain_community.vectorstores.utils.filter_complex_metadata.`

snexus commented 3 months ago

Thanks for reporting, fixed in https://github.com/snexus/llm-search/pull/104 To be honest, don't think you will get great results with CSVs, it is not the best format for RAG.

mohammad-yousuf commented 3 months ago

Hi @snexus. Thank you for the fix. I converted the csv data to docx table and used custom parser. The data is being converted to json format correctly as I can see after re-ranking step. After that, it doesn't work for closely related data points.

Any idea how should I approach this?

snexus commented 3 months ago

Most likely reason LLM can't interpret it correctly - it is a limitation of LLM rather RAG system as a whole. LLMs are not very good with tabular data. Maybe there is specialised LLM exist for that.

Another (more complicated approach) - to store data in a database, provide schema and other metadata information to LLM and let it generate SQL that produces necessary aggregations etc...