Closed jinkjonks closed 3 weeks ago
I wrote this workaround which extends Document
I also found the same behaviour. I was trying to parse this pdf and noticed that some of the text appears twice in the output of Document.to_text()
method.
The issue occurs when the document tree has the following (or similar) structure:
Section-1
├── Section-2
├── Section-3
Here is the current implementation of the to_text
method:
def to_text(self):
"""
Returns text of a document by iterating through all the sections '\n'
"""
text = ""
for section in self.sections():
text = text + section.to_text(include_children=True, recurse=True) + "\n"
return text
Therefore, the text of Section-2
is included in the output when the to_text
method is called (recursively) for Section-1
as well as forSection-2
.
Similarly, the text of Section-3
is also duplicated in the output.
I have created a pull request #83 which fixes this issue.
Block json:
Expected:
However actual has h2 and all its children repeated
Similar output for
to_text()
Debug from
LayoutReader
which does not have the same problem:Additional notes:
chunks()
.sections()