gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
29 stars 14 forks source link

Ingestion call for csv file fails with "No documents found" error and still returns 200. #6

Closed sangee2004 closed 6 months ago

sangee2004 commented 6 months ago

Steps to reproduce the problem:

  1. make run from https://github.com/gptscript-ai/knowledge to launch knowledge in server mode.
  2. Copy bin/knowledge to path (/usr/local/bin)
  3. Run the following script to ingest and search a csv file - https://github.com/gptscript-ai/csv-reader/blob/main/examples/Electric_Vehicle_Population_Data.csv
    
    # Set KNOW_SERVER_URL=standalone to use the client in standalone mode (i.e. without a server running)

tools: Create Dataset, sys.find, Ingest, Retrieve

Create a new Knowledge Base Dataset with ID ${id} Then, ingest ${filepath} into the dataset. Then, figure out ${query} from the previously ingested files.


name: create dataset description: Create a new Dataset in the Knowledge Base args: id: ID of the Dataset

!knowledge client create-dataset ${id}


name: ingest description: Ingest a file or all files from a directory into a Knowledge Base Dataset args: id: ID of the Dataset args: filepath: Path to the file or directory to be ingested

!knowledge client ingest ${id} ${filepath}


name: retrieve description: Retrieve information from a Knowledge Base Dataset args: id: ID of the Dataset args: query: Query to be executed against the Knowledge Base Dataset

!knowledge client retrieve ${id} ${query}

4.Ingestion call for csv file fails with "No documents found" error and still returns 200.  Retrieval calls to knowledge fails with "500" error since there are no documents to search.

gptscript --disable-cache --debug testknowledge.gpt --id 66664 --filepath /Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv --query "Tell me what are VIN number from vehicles are made By TESLA in AZ"

Last few lines from debug log indicating Retrieval tool call failure: 

{ "completionID": "3", "id": "call_rYL6JtFkWek5R5u7k9FL8lSk", "level": "debug", "logger": "/pkg/monitor", "msg": "debug", "parentID": "1", "request": { "command": [ "/usr/local/bin/knowledge", "client", "retrieve", "66664", "Tell me what are VIN number from vehicles are made By TESLA in AZ" ], "input": "{\"id\":\"66664\",\"query\":\"Tell me what are VIN number from vehicles are made By TESLA in AZ\"}" }, "time": "2024-04-30T12:00:57-07:00", "toolID": "testknowledge.gpt:25" } 2024/04/30 12:00:57 API request failed: 500 Internal Server Error { "level": "error", "logger": "/pkg/engine", "msg": "failed to run tool [retrieve] cmd [/usr/local/bin/knowledge client retrieve 66664 Tell me what are VIN number from vehicles are made By TESLA in AZ]: exit status 1", "time": "2024-04-30T12:00:57-07:00" } { "cached": false, "completionID": "3", "id": "call_rYL6JtFkWek5R5u7k9FL8lSk", "level": "debug", "logger": "/pkg/monitor", "msg": "debug", "parentID": "1", "response": { "err": {}, "output": "" }, "time": "2024-04-30T12:00:57-07:00", "toolID": "testknowledge.gpt:25" } { "err": "ERROR: 2024/04/30 12:00:57 API request failed: 500 Internal Server Error\n: exit status 1", "level": "debug", "logger": "/pkg/monitor", "msg": "Run stopped", "output": "", "runID": "1", "time": "2024-04-30T12:00:57-07:00" } 2024/04/30 12:00:57 ERROR: 2024/04/30 12:00:57 API request failed: 500 Internal Server Error : exit status 1


**Knowledge logs:**

2024/04/30 12:00:47 INFO Creating dataset id=66664 [GIN] 2024/04/30 - 12:00:47 | 200 | 1.043792ms | ::1 | POST "/v1/datasets/create" 2024/04/30 12:00:49 INFO Ingesting content into dataset dataset=66664 2024/04/30 12:00:53 DEBUG Received ingest request content_size=74720880 metadata="&{Name:Electric_Vehicle_Population_Data.csv AbsolutePath:/Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv Size:42030494 ModifiedAt:2024-04-29 16:46:18.893437515 -0700 PDT}" 2024/04/30 12:00:53 INFO IngestOpts opts="{Filename:0x14000996060 FileMetadata:0x140037f21c0 IsDuplicateFuncName: IsDuplicateFunc:}" 2024/04/30 12:00:53 DEBUG Loading data type=.csv filename=Electric_Vehicle_Population_Data.csv 2024/04/30 12:00:53 ERROR No documents found [GIN] 2024/04/30 - 12:00:53 | 200 | 3.797541083s | ::1 | POST "/v1/datasets/66664/ingest" 2024/04/30 12:00:57 INFO Retrieving content from dataset dataset=66664 2024/04/30 12:00:57 DEBUG Retrieving content from dataset dataset=66664 query="{Prompt:Tell me what are VIN number from vehicles are made By TESLA in AZ TopK:0x140086f80d8}" 2024/04/30 12:00:57 DEBUG Reduced number of documents to search for numDocuments=0 2024/04/30 12:00:57 ERROR Failed to retrieve documents error="nResults must be > 0" [GIN] 2024/04/30 - 12:00:57 | 500 | 160.57725ms | ::1 | POST "/v1/datasets/66664/retrieve"

sangee2004 commented 6 months ago

This issue is not seen anymore. I am able to ingest a smaller csv file successfully and able to retrieve data from it.

Ingestion cvs files seems to take a very long time. The csv file used in this issue is 42 MB and the ingestion of this file was not done even after 6 minutes

/usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/Electric_Vehicle_Population_Data.csv
2024/05/03 13:24:28 INFO IngestOpts opts="{Filename:0x1400709a000 FileMetadata:0x1400705e000 IsDuplicateFuncName: IsDuplicateFunc:0x105378920}"
^C2024/05/03 13:30:28 ERROR Failed to add documents error="couldn't add document '10a2c04c-7a9a-43fd-9c3b-be85d1e226b8': couldn't create embedding of document: couldn't send request: Post \"https://api.openai.com/v1/embeddings\": context canceled"

Even ingestion of relatively small file (36 kB), takes about 15 seconds

 /usr/local/bin/knowledge ingest -d  testnewcvs /Users/sangeethahariharan/Downloads/industry_sic.csv                    
2024/05/03 13:31:00 INFO IngestOpts opts="{Filename:0x1400e4da780 FileMetadata:0x1400c69c340 IsDuplicateFuncName: IsDuplicateFunc:0x101ce8920}"
2024/05/03 13:31:15 INFO Ingested document filename=industry_sic.csv count=731 absolute_path=/Users/sangeethahariharan/Downloads/industry_sic.csv

industry_sic.csv