nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

Parse nodes on a para-point level #55

Open irash03 opened 4 months ago

irash03 commented 4 months ago

Hi,

I'm trying to parse a document which has a lot of points which in turn has sub points. Goal is to split the text point-wise and parse them as llama-index nodes. For Example, I would like to have this as a single node:

Screenshot 2024-02-15 123256

However, when I parse and iterate through chunks (doc.chunks()), the heirarchy for points and subpoints aren't getting assigned.

All these chunks are independent and have no relationship with each other other than with the section heading:

Screenshot 2024-02-15 1232562

Based on my understanding, we can probably try the following:

  1. Manually Assign the parent node (para) to the 4 sub points (lists)
  2. Parse the document into nodes on a section level and then use sentence splitters using llama index API (might not be optimal).

Kindly let me know if there's any alternatives for this.

Thanks!

Avinash-Raj commented 2 months ago

@irash03 You could do like below to form a whole chunk from texts of children belonging to each section like you said in the point 2.

pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
doc = pdf_loader.pdf_reader.read_pdf(pdf_url)
for section in doc.sections():
    chunk = f"{section.to_context_text()}\n\n"
    for child in section.children:
        chunk += child.to_context_text(include_section_info=False)
    print(chunk)