Open sangee2004 opened 1 month ago
Confirmed, will check it out in more detail
It's pretty slow, because the current implementation of the CSV documentloader splits the file into one document (chunk) per row and then calls the embeddings API once per document. #25 should improve embeddings speed. Additionally, I'll put it on my list to create a new variant of the CSV documentloader that allows reading the whole csv as a single document or as a set of documents with a pre-defined max size.
Also, if CSV is usually some structured dataset, it might be better to use tools like https://github.com/gptscript-ai/structured-data-querier? It feels like RAG might not be a good case if there are millions of data in csv file.
When there are unsupported file formats that ignored in the directory that gets ingested , Can we provide the user with a message indicating the files that were not ingested ?
I agree with @StrongMonkey that the knowledge tool may not be the best tool for very structured data like CSV at least when it comes to factual searches with specific answers. It may work though if it's only about finding a single row with some content or for more exploratory searches.
@sangee2004 I think we have warning/debug logs indicating that files are being ignored. However, those are not shown to the LLM so may be hidden to the user. I'm not sure what's the best approach here, but to be transparent I guess we can log this information to stdout as well so that the LLM can tell the user which files have been ignored :thinking: WDYT @StrongMonkey ?
Ingestion of cvs files takes a very long time. This csv file https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv which is 42 MB is not done with ingestion even after 6 minutes.
Even ingestion of relatively small file industry_sic.csv (36 kB), takes about 15 seconds