nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.17k stars 117 forks source link

Incomplete Document Loading #9

Open KostyaVoronin opened 8 months ago

KostyaVoronin commented 8 months ago

Loading this document (from a local machine) returns only 3 incomplete sections.

Anyway to load the entire document?

ansukla commented 8 months ago

Checked the document. In this case, the parser is doing a ok job and I see that it is not recognizing the intermediate sections due to lack of any visual indication.

It is parsing the document fully though and you can see it by using: HTML(doc.sections()[1].to_html(include_children=True, recurse=True)).

You can use the doc.json raw format to re-process it to better fit your document. For example, you can set any para that is beginning with "Section" as a "header" and then push the level up of the following items by 1 until you hit the next section.

We will add this document to our list of things to refine in subsequent releases.