Open reidperyam opened 6 months ago
After further investigation this seems to be an issue loading certain .csv files using PapaCSVReader.
as example the .csv used in LlamaIndexTS example documentation loads as expected with the referenced code, above (this is the file: titanic_train.csv )
however other tested, valid .csv files are not parsed as expected (see previous, MOCK_DATA.csv)
There is a workaround to circumvent this and allow other .csvs to be loaded. The default constructor of PapaCSVReader sets concatRows=true, if this is set to false, the document store can be loaded as expected, but this creates a document for each line in the .csv.
@reidperyam the error says that the text is too long to generate an embedding. My guess is that the CSV parser generates very long sentences, can you try:
Settings.nodeParser = new SimpleNodeParser({
chunkSize: 512,
chunkOverlap: 20,
splitLongSentences: true,
});
I added this code : and
and the issue remains:
I pushed these changes up to the github repro repo @marcusschiesser
Sorry but I cannot reproduce this bug
~/Code/practical-star git:[master]
npm run generate
> nextjs@0.1.0 generate
> tsx app/generate.ts
Using 'openai' model provider
CHUNK_SIZE 512
CHUNK_OVERLAP 20
EMBEDDING_DIM 1024
Generating generateDatasource...
STORAGE_CACHE_DIR ./cache
Generating serviceContextFromDefaults...
Generating storageContextFromDefaults...
No valid data found at path: cache/index_store.json starting new store.
No valid data found at path: ./cache/vector_store.json starting new store.
getting docutments...
document ac67672e-d880-4808-8c22-97071ee2f947 loaded
Generating VectorStoreIndex.fromDocuments...
Storage context successfully generated in 0.006s.
Finished generating storage.
@himself65 But does it actually create a vector_store.json and index_store.json?
I could reproduce the bug with the tokens, and it only created a doc_store.json If you then run it again with a doc_store.json in /cache it will show the output as successful like in your snippet above, but it's still not doing anything.
will check this, I think there are some bugs in node parser
I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents
@himself65 Additional information.
If you triple the size of titanic_train.csv by just copy pasting the content twice, it works as it should and creates an index_store.json and vector_store,json that contains the same results thrice, even tho it has more rows and columns.
I could reproduce the error with multiple mock_files from different file generators. But both titanic_train.csv and movie_reviews.csv worked even when increasing the size. So something about the underlying csv structure?
@himself65 isolated the issue down to the defaultregex
used in TextSplitter.ts
If we extend it to include one or more whitespaces instead of just one whitespace, it fixes the issue:
const defaultregex = /[.?!][])'"`’”]*(?:\s+|$|)/g;
I don't really know enough about document structures to judge if that change would break a lot of stuff or not.
Alternative, add customregex
param to SentenceSplitter
?
the defaultregex above is missing the \ in [ ]
I think we shouldn't modify the sentence splitter since there is no grammar for a CSV result. So I think it's better to split the CSV to different documents
After further testing the above "fix" only works with the mock file, but still struggles with other files. So true, no proper grammar in csv, so probably quite hard to detect the right regex. So either: -> Trim csv to a pre-determined structure? -> Split document before generating nodes?
Bug Description
The following code fails to generate a vectorStorageIndex from a simple .CSV file (see attached MOCK_DATA.csv ).
Version
0.2.10
Steps to Reproduce
clone the following github repo:
https://github.com/reidperyam/practical-star/tree/master
see README.md for repro which I will copy here:
First, populate .env with OPENAI_API_KEY
install the dependencies:
npm install
verify that all contents of /cache directory are removed!
Second, generate the embeddings of the documents in the ./data directory:
npm run generate
EXPECTED RESULT:
generated cache/doc_store.json generated cache/index_store.json generated cache/vector_store.json
ACTUAL RESULT:
Error generating text embeddings: [see Relevant Logs/Tracbacks added here
Relevant Logs/Tracbacks