Open livelxw opened 2 months ago
I'd also like to add that calling Document.to_text()
outputs duplicated text as it's being called on each section and sections can be children of other sections.
In this example, self.sections()[0]
(block_idx=1
) is the parent of self.sections()[1]
(block_idx=2
), so obviously calling to_text()
on both will result in duplicated text.
I also think it would make more sense to have to_text()
implemented on the Document.root_node
, which was the behaviour I was expecting before looking through the documentation.
It seems much more logical to simply do this? It seems to work on a simple PDF:
def to_text(self):
text = ""
for n in self.root_node.children:
text = text + n.to_text(include_children=True, recurse=True) + "\n"
return text
I found that the
to_text()
reads sections:and
self.sections()
reads child nodes ofroot_node
with tagheader
:When the response from nlm-ingestor server doesn't contain sections, the function will return emtpy string. Should it get text from all children of
root_node
?