[Bug]: PapaCSVReader concatRows=true fails for some .csv files

reidperyam commented 6 months ago

Bug Description

The following code fails to generate a vectorStorageIndex from a simple .CSV file (see attached MOCK_DATA.csv ).

import { SimpleDirectoryReader, PapaCSVReader } from "llamaindex";

export const DATA_DIR = "./data";

export async function getDocuments() {
  return await new SimpleDirectoryReader().loadData({
    directoryPath: DATA_DIR,
    fileExtToReader: {
      csv: new PapaCSVReader()
    }
  });
}

Version

0.2.10

Steps to Reproduce

clone the following github repo:

https://github.com/reidperyam/practical-star/tree/master

see README.md for repro which I will copy here:

First, populate .env with OPENAI_API_KEY

install the dependencies:

npm install

verify that all contents of /cache directory are removed!

Second, generate the embeddings of the documents in the ./data directory:

npm run generate

EXPECTED RESULT:

generated cache/doc_store.json generated cache/index_store.json generated cache/vector_store.json

ACTUAL RESULT:

Error generating text embeddings: [see Relevant Logs/Tracbacks added here

Relevant Logs/Tracbacks

BadRequestError: 400 This model's maximum context length is 8192 tokens, however you requested 19956 tokens (19956 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
    at Function.generate (C:\Code\canoe\node_modules\openai\src\error.ts:70:14)
    at OpenAI.makeStatusError (C:\Code\canoe\node_modules\openai\src\core.ts:383:21)
    at OpenAI.makeRequest (C:\Code\canoe\node_modules\openai\src\core.ts:446:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async OpenAIEmbedding.getOpenAIEmbedding (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:100:26)
    at async OpenAIEmbedding.getTextEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\OpenAIEmbedding.js:111:16)
    at async batchEmbeddings (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:61:32)
    at async OpenAIEmbedding.getTextEmbeddingsBatch (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:43:16)
    at async OpenAIEmbedding.transform (C:\Code\canoe\node_modules\llamaindex\dist\cjs\embeddings\types.js:47:28)
    at async VectorStoreIndex.getNodeEmbeddingResults (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:487:9)
    at async VectorStoreIndex.insertNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:572:17)
    at async VectorStoreIndex.buildIndexFromNodes (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:497:9)
    at async VectorStoreIndex.init (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:445:13)
    at async VectorStoreIndex.fromDocuments (C:\Code\canoe\node_modules\llamaindex\dist\cjs\indices\vectorStore\index.js:523:16)

reidperyam commented 6 months ago

After further investigation this seems to be an issue loading certain .csv files using PapaCSVReader.

as example the .csv used in LlamaIndexTS example documentation loads as expected with the referenced code, above (this is the file: titanic_train.csv )

however other tested, valid .csv files are not parsed as expected (see previous, MOCK_DATA.csv)

There is a workaround to circumvent this and allow other .csvs to be loaded. The default constructor of PapaCSVReader sets concatRows=true, if this is set to false, the document store can be loaded as expected, but this creates a document for each line in the .csv.

marcusschiesser commented 6 months ago

@reidperyam the error says that the text is too long to generate an embedding. My guess is that the CSV parser generates very long sentences, can you try:

Settings.nodeParser = new SimpleNodeParser({
  chunkSize: 512,
  chunkOverlap: 20,
  splitLongSentences: true,
});

reidperyam commented 6 months ago

I added this code : and

and the issue remains:

I pushed these changes up to the github repro repo @marcusschiesser

himself65 commented 6 months ago

Sorry but I cannot reproduce this bug

~/Code/practical-star git:[master]
npm run generate

> nextjs@0.1.0 generate
> tsx app/generate.ts

Using 'openai' model provider
CHUNK_SIZE 512
CHUNK_OVERLAP 20
EMBEDDING_DIM 1024
Generating generateDatasource...
STORAGE_CACHE_DIR ./cache
Generating serviceContextFromDefaults...
Generating storageContextFromDefaults...
No valid data found at path: cache/index_store.json starting new store.
No valid data found at path: ./cache/vector_store.json starting new store.
getting docutments...
document ac67672e-d880-4808-8c22-97071ee2f947 loaded
Generating VectorStoreIndex.fromDocuments...
Storage context successfully generated in 0.006s.
Finished generating storage.

ghost commented 6 months ago

@himself65 But does it actually create a vector_store.json and index_store.json?

I could reproduce the bug with the tokens, and it only created a doc_store.json If you then run it again with a doc_store.json in /cache it will show the output as successful like in your snippet above, but it's still not doing anything.

himself65 commented 6 months ago

will check this, I think there are some bugs in node parser

himself65 commented 6 months ago

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

ghost commented 6 months ago

@himself65 Additional information.

If you triple the size of titanic_train.csv by just copy pasting the content twice, it works as it should and creates an index_store.json and vector_store,json that contains the same results thrice, even tho it has more rows and columns.

I could reproduce the error with multiple mock_files from different file generators. But both titanic_train.csv and movie_reviews.csv worked even when increasing the size. So something about the underlying csv structure?

ghost commented 6 months ago

@himself65 isolated the issue down to the defaultregex used in TextSplitter.ts

If we extend it to include one or more whitespaces instead of just one whitespace, it fixes the issue:

const defaultregex = /[.?!][])'"`’”]*(?:\s+|$|)/g;

I don't really know enough about document structures to judge if that change would break a lot of stuff or not. Alternative, add customregex param to SentenceSplitter?

ghost commented 6 months ago

Screenshot 2024-05-16 174539

the defaultregex above is missing the \ in [ ]

ghost commented 6 months ago

I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents

After further testing the above "fix" only works with the mock file, but still struggles with other files. So true, no proper grammar in csv, so probably quite hard to detect the right regex. So either: -> Trim csv to a pre-determined structure? -> Split document before generating nodes?

run-llama / LlamaIndexTS