`to_text()` returns emtpy text when the document doesn't have sections

nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects

MIT License

1.15k stars 113 forks source link

def to_text(self): """ Returns text of a document by iterating through all the sections '\n' """ text = "" for section in self.sections(): text = text + section.to_text(include_children=True, recurse=True) + "\n" return text

def sections(self): """ Returns all the sections in the block. This is useful for getting all the sections in a document. """ sections = [] def chunk_collector(node): if node.tag in ['header']: sections.append(node) self.iter_children(self, 0, chunk_collector) return sections

I'd also like to add that calling Document.to_text() outputs duplicated text as it's being called on each section and sections can be children of other sections.

In this example, self.sections()[0] (block_idx=1) is the parent of self.sections()[1] (block_idx=2), so obviously calling to_text() on both will result in duplicated text.

Screenshot 2024-04-25 at 11 16 35

I also think it would make more sense to have to_text() implemented on the Document.root_node, which was the behaviour I was expecting before looking through the documentation. It seems much more logical to simply do this? It seems to work on a simple PDF:

def to_text(self):
    text = ""
    for n in self.root_node.children:
        text = text + n.to_text(include_children=True, recurse=True) + "\n"
    return text

nlmatics / llmsherpa

`to_text()` returns emtpy text when the document doesn't have sections #73