nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

Not able to get all the subsection names inside a section #36

Open Amy-raj opened 6 months ago

Amy-raj commented 6 months ago

Hi,I am using the attached pdf for testing.There is no whitespace between subsection title and subsection content.It is not able to extract all the subsection titles present within a section.I tried with a different pdf where white space is there ,It was working pretty good.Could you please guide how we can extract specific subsection title along with corresponding content ? RWXcE3.pdf.pdf

ansukla commented 6 months ago

Hi Amy-raj,

The sections seem to parse quite well. You can get the first level sections by traversing through children of root and then get the next level of section by traversing through the children of each section. Hope this helps. 2023-12-20_08-26-58

Amy-raj commented 6 months ago

Hi,I am using the below code for extraction of subsection and it is not able to extract all the subsections.For example for “TERM AND TERMINATION” section ,it is extracting only 4 subsections whereas 6 subsections are present.I am seeing this issue with many sections in the pdf.You can check the code and output below in the image. IMG_2093