Parse nodes on a para-point level

nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects

MIT License

1.15k stars 113 forks source link

Hi,

I'm trying to parse a document which has a lot of points which in turn has sub points. Goal is to split the text point-wise and parse them as llama-index nodes. For Example, I would like to have this as a single node:

Screenshot 2024-02-15 123256

However, when I parse and iterate through chunks (doc.chunks()), the heirarchy for points and subpoints aren't getting assigned.

All these chunks are independent and have no relationship with each other other than with the section heading:

Screenshot 2024-02-15 1232562

Based on my understanding, we can probably try the following:

Manually Assign the parent node (para) to the 4 sub points (lists)
Parse the document into nodes on a section level and then use sentence splitters using llama index API (might not be optimal).

Kindly let me know if there's any alternatives for this.

Thanks!

pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url) doc = pdf_loader.pdf_reader.read_pdf(pdf_url) for section in doc.sections(): chunk = f"{section.to_context_text()}\n\n" for child in section.children: chunk += child.to_context_text(include_section_info=False) print(chunk)

nlmatics / llmsherpa

Parse nodes on a para-point level #55