gptscript-ai / knowledge

knowledge for GPTScript
Apache License 2.0
15 stars 5 forks source link

Ingesting cvs files takes a very long time. #9

Open sangee2004 opened 1 month ago

sangee2004 commented 1 month ago

Ingestion of cvs files takes a very long time. This csv file https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv which is 42 MB is not done with ingestion even after 6 minutes.

%/usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv
2024/05/03 13:24:28 INFO IngestOpts opts="{Filename:0x1400709a000 FileMetadata:0x1400705e000 IsDuplicateFuncName: IsDuplicateFunc:0x105378920}"
^C2024/05/03 13:30:28 ERROR Failed to add documents error="couldn't add document '10a2c04c-7a9a-43fd-9c3b-be85d1e226b8': couldn't create embedding of document: couldn't send request: Post \"https://api.openai.com/v1/embeddings\": context canceled"

Even ingestion of relatively small file industry_sic.csv (36 kB), takes about 15 seconds

% /usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/industry_sic.csv                    
2024/05/03 13:31:00 INFO IngestOpts opts="{Filename:0x1400e4da780 FileMetadata:0x1400c69c340 IsDuplicateFuncName: IsDuplicateFunc:0x101ce8920}"
2024/05/03 13:31:15 INFO Ingested document filename=industry_sic.csv count=731 absolute_path=/Users/sangeethahariharan/Downloads/industry_sic.csv
iwilltry42 commented 1 month ago

Confirmed, will check it out in more detail

iwilltry42 commented 1 week ago

It's pretty slow, because the current implementation of the CSV documentloader splits the file into one document (chunk) per row and then calls the embeddings API once per document. #25 should improve embeddings speed. Additionally, I'll put it on my list to create a new variant of the CSV documentloader that allows reading the whole csv as a single document or as a set of documents with a pre-defined max size.

StrongMonkey commented 1 week ago

Also, if CSV is usually some structured dataset, it might be better to use tools like https://github.com/gptscript-ai/structured-data-querier? It feels like RAG might not be a good case if there are millions of data in csv file.

sangee2004 commented 1 week ago

When there are unsupported file formats that ignored in the directory that gets ingested , Can we provide the user with a message indicating the files that were not ingested ?

iwilltry42 commented 1 week ago

I agree with @StrongMonkey that the knowledge tool may not be the best tool for very structured data like CSV at least when it comes to factual searches with specific answers. It may work though if it's only about finding a single row with some content or for more exploratory searches.

@sangee2004 I think we have warning/debug logs indicating that files are being ignored. However, those are not shown to the LLM so may be hidden to the user. I'm not sure what's the best approach here, but to be transparent I guess we can log this information to stdout as well so that the LLM can tell the user which files have been ignored :thinking: WDYT @StrongMonkey ?