nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

`to_text()` returns emtpy text when the document doesn't have sections #73

Open livelxw opened 2 months ago

livelxw commented 2 months ago

I found that the to_text() reads sections:

    def to_text(self):
        """
        Returns text of a document by iterating through all the sections '\n'
        """
        text = ""
        for section in self.sections():
            text = text + section.to_text(include_children=True, recurse=True) + "\n"
        return text

and self.sections() reads child nodes of root_node with tag header:

    def sections(self):
        """
        Returns all the sections in the block. This is useful for getting all the sections in a document.
        """
        sections = []
        def chunk_collector(node):
            if node.tag in ['header']:
                sections.append(node)
        self.iter_children(self, 0, chunk_collector)
        return sections

When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of root_node?

thomastiotto commented 2 months ago

I'd also like to add that calling Document.to_text() outputs duplicated text as it's being called on each section and sections can be children of other sections.

In this example, self.sections()[0] (block_idx=1) is the parent of self.sections()[1] (block_idx=2), so obviously calling to_text() on both will result in duplicated text.

Screenshot 2024-04-25 at 11 16 35

I also think it would make more sense to have to_text() implemented on the Document.root_node, which was the behaviour I was expecting before looking through the documentation. It seems much more logical to simply do this? It seems to work on a simple PDF:

def to_text(self):
    text = ""
    for n in self.root_node.children:
        text = text + n.to_text(include_children=True, recurse=True) + "\n"
    return text