run-llama / LlamaIndexTS

Data framework for your LLM applications. Focus on server side solution
https://ts.llamaindex.ai
MIT License
1.89k stars 354 forks source link

Parsing error when attempting to generate vector store #1407

Open george-wb opened 2 days ago

george-wb commented 2 days ago

Describe the bug I have a node/express server hosted on AWS that runs all my RAG stuff. All of my documents (45 documents) are parsed with LlamaParse and then downloaded as Markdown files. I host those docs in a CMS, request the data from my node/express server, create new Document instances for each document, and then use that Array of Documents to create a vector store. This works for about ~85% of my documents. But there are a handful of documents where I get one of these parsing errors:

peg$SyntaxError: Expected "http://", "https://", [([{"'`‘], [0-9], [^ \t\n\r!?([}"`)\]}"`0-9@], [^ \t\n\r!?.([})\]}`"0-9@], [a-z0-9], or [a-z] but "\n" found.

or

peg$SyntaxError: Expected "http://", "https://", [0-9], [^ \t\n\r!?([}"`)\]}"`0-9@], [^ \t\n\r!?.([})\]}`"0-9@], [a-z0-9], or [a-z] but "`" found.

Is this something that happens when certain documents get too long? Some of the rejected docs are pretty lengthy, but I didn't think there'd be a limit to how many lines a document can have.

To Reproduce It seems to be dependent on the markdown file. But as I said, I've parsed all of my files using LlamaParse and saved those markdown files to my CMS. I'm not sure why characters "\n" and "`" are throwing an error, in my other files those characters seem to pass the parser without any issue.

Code to reproduce the behavior:

// fetch data from cms
const docs = [];
const docsFromCms = await fetch(...);

for (const doc of docsFromCms) {
      try {
          docs.push(
              new Document({
                  name: doc.name,
                  text: doc.text,
                               metadata: {...}
              })
          );
      } catch (error) {
          console.log("something went wrong", error);
      }
}
const storageContext = await storageContextFromDefaults({
    persistDir: "./cache/storage",
});
console.log("storage context created");

indexStore = await VectorStoreIndex.fromDocuments(docs, {
    storageContext,
});

Expected behavior I am expecting the markdown files that I have parsed with LlamaParse to be able to be utilized by my RAG system so I can query for that information, but I'm getting an error about characters that should be allowed.

Desktop (please complete the following information):

george-wb commented 2 days ago

Here's a screenshot of an example of the error I'm getting: Screen Shot 2024-10-28 at 5 04 54 PM

george-wb commented 7 hours ago

@himself65 I want to add some more info as I've been trying to figure out what's tripping up the parser:

The files that have been throwing these errors have JSON examples in them. This is noted in the standard markdown with three backticks + "json" at the beginning and three backticks end of the example JSON block like this:

{
   "hello": "world",
   "moreObjects":
      [
         { "nested": "inside" },
         { "like": "this" }
      ]
}

I've verified that all the JSON is valid, I've also played around with removing the backticks and just having the JSON typed out raw. None of these seem to work with the parser. Removing the JSON examples altogether finally passes the parser, which is not ideal since the examples are integral to my RAG server.

When there are backticks around the JSON examples, the parser complains about the "`" character, and when there aren't any backticks the parser complains about the "\n" character.

Information gathered from node-parser index.js file:

When it was failing, I logged out some information in the ./node_modules/@llamaindex/core/dist/node-parser/index.js file that could be useful.

conditions: file has JSON examples, but no backticks.

I've noticed the peg$parseOpenSymbol test function logs true for quotation marks, square brackets, and backticks but I get false on the string with the curly bracket. Inside of that function I'm logging the following:

console.log("hit peg$parseOpenSymbol", peg$currPos, input.charAt(peg$currPos), peg$c37.test(input.charAt(peg$currPos)))

Screen Shot 2024-10-30 at 5 38 44 PM

As you can see in the screenshot, there is an "{" open symbol but the peg$parseOpenSymbo function is registering the "\n" character instead.

Finally, at the end of the peg$parse function, there's an if else statement that checks whether the peg$result is a failed status or not. When the else condition (failed) is executed this is what I have been logging out: console.log(peg$result):

[
 'brandedGraphics: false\n' +
   'creativeConcept: false\n' +
   'images: false\n' +
   'pageGeneration: true\n' +
   'profileCreation: false\n' +
   'sectionGeneration: false\n' +
   'sitemapCreation: true\n' +
   'visualDesign: false\n' +
   '\n' +
   '# AI Sitemap Examples\n' +
   '\n' +
   '## AI Startup 1\n' +
   '\n' +
   '### Customer Profile\n' +
   '\n' +
   '**Company Name:**\n' +
   '\n' +
   'ForeSight AI\n' +
   '\n' +
   '**Industry:**\n' +
   '\n' +
   'AI - Predictive Analytics\n' +
   '\n' +
   '**Company Description:**\n' +
   '\n' +
   'ForeSight AI leverages advanced machine learning algorithms to transform historical data into actionable future insights.',
 'By analyzing trends, behaviors, and patterns, ForeSight AI enables businesses to anticipate customer needs, optimize inventory, and make proactive decisions.',
 "ForeSight AI's platform adapts to various industries, including retail, finance, and healthcare, providing customized predictions that empower organizations to stay ahead of market shifts and demand fluctuations.",
 '**Value Proposition:**\n' +
   '\n' +
   'ForeSight AI provides a powerful predictive analytics platform that helps organizations turn data into future-focused insights.',
 'With scalable algorithms and tailored forecasting solutions, ForeSight AI reduces uncertainty, optimizes operational decisions, and enhances strategic planning.',
 '**Specialization:**\n' +
   '\n' +
   'ForeSight AI specializes in demand forecasting, customer behavior prediction, and trend analysis, offering customizable, industry-specific algorithms that fit a variety of business models.',
 '**Website Audience:**\n' +
   '\n' +
   'The website targets business leaders, data scientists, and operations managers across industries such as retail, finance, and healthcare who seek to improve decision-making with predictive analytics.',
 '**Location:**\n' +
   '\n' +
   'New York, NY\n' +
   '\n' +
   '**Differentiators:**\n' +
   '\n' +
   'ForeSight AI stands out with its flexible, adaptive algorithms and industry-specific prediction models that offer businesses unique insights tailored to their operational needs and growth strategies.',
 '**Website Goals:**\n' +
   '\n' +
   'The website aims to attract executives and data professionals, encourage demo requests, and position ForeSight AI as a leading predictive analytics solution that empowers proactive business decisions.',
 '**Brand Personality:**\n' +
   '\n' +
   'ForeSight AI is intelligent, forward-thinking, and precise, focused on delivering clear, data-driven insights that inspire confident decisions.',
 '**Visual Styles:**\n' +
   '\n' +
   'The design is sleek and data-centric, with modern, structured layouts, sophisticated visualizations, and a minimalist color scheme that reflects innovation and accuracy.',
 '### Sitemap\n\n{\n"pages":'
]
himself65 commented 7 hours ago

Can you upgrade to latest version of LITS? I just debuged and we have a try catch here to ignore the error. Our upstream is somehow buggy