nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

[BUG] Duplicated nodes in output from `to_html()` and `to_text()` #79

Closed jinkjonks closed 3 weeks ago

jinkjonks commented 2 months ago

Block json:

[
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 0,
    "level": 0,
    "page_idx": 0,
    "sentences": ["History"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 1,
    "level": 1,
    "page_idx": 0,
    "sentences": ["Jenny"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 2,
    "level": 2,
    "page_idx": 0,
    "sentences": ["Birth"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 3,
    "level": 3,
    "page_idx": 0,
    "sentences": [
      "Ship of Fools is a satirical allegory in German verse published in 1949 in Basel, Switzerland by the humanist and theologian Segbastian Brant.",
      "An infant Jenny was discovered on board in a wicker basket and tossed overboard to make room for Lord Nelson’s barrel.",
      "She miraculously floated to an island which was later named Australia."
    ],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 4,
    "level": 1,
    "page_idx": 0,
    "sentences": ["Cat in the Hat "],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 5,
    "level": 2,
    "page_idx": 0,
    "sentences": ["Dr Seuss Novel"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 6,
    "level": 3,
    "page_idx": 0,
    "sentences": [
      "“The Cat in the Hat” was well-received however, future historians agree that this was not Dr Seuss’s best work.",
      "Seuss tried again a year later to garner better affections from his audience by publishing its sequel “The Cat in the Hat Comes Back” to no avail.",
      "Seuss would go on to publish “Green Eggs and Ham” three years later in 1960 which garnered critical acclaim."
    ],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 7,
    "level": 2,
    "page_idx": 0,
    "sentences": ["2003 Live Movie Adaptation"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 8,
    "level": 3,
    "page_idx": 0,
    "sentences": ["Widely regarded as a cult classic."],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 9,
    "level": 1,
    "page_idx": 0,
    "sentences": ["Hoy’s Self-Defenestration of 1993"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 10,
    "level": 2,
    "page_idx": 0,
    "sentences": ["This is an event."],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 11,
    "level": 1,
    "page_idx": 0,
    "sentences": ["Other Notable Events"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 12,
    "level": 2,
    "page_idx": 0,
    "sentences": ["1. The Hyatt Regency Walkway Collapse"],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 13,
    "level": 2,
    "page_idx": 0,
    "sentences": ["2. George Miller’s Happy Feet"],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 14,
    "level": 2,
    "page_idx": 0,
    "sentences": ["3. Birth of Shadow Wong circa 2012"],
    "tag": "para"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-header",
    "block_idx": 15,
    "level": 1,
    "page_idx": 0,
    "sentences": ["Conclusion"],
    "tag": "header"
  },
  {
    "bbox": [],
    "block_class": "nlm-text-body",
    "block_idx": 16,
    "level": 2,
    "page_idx": 0,
    "sentences": ["Good times."],
    "tag": "para"
  }
]

Expected:

<html>
  <body>
    <h1>History</h1>
    <h2>Jenny</h2>
    <h3>Birth</h3>
    <p>
      Ship of Fools is a satirical allegory in German verse published in 1949 in
      Basel, Switzerland by the humanist and theologian Segbastian Brant. An
      infant Jenny was discovered on board in a wicker basket and tossed
      overboard to make room for Lord Nelson&rsquo;s barrel. She miraculously
      floated to an island which was later named Australia.
    </p>
    <h2>Cat in the Hat</h2>
    <h3>Dr Seuss Novel</h3>
    <p>
      &ldquo;The Cat in the Hat&rdquo; was well-received however, future
      historians agree that this was not Dr Seuss&rsquo;s best work. Seuss tried
      again a year later to garner better affections from his audience by
      publishing its sequel &ldquo;The Cat in the Hat Comes Back&rdquo; to no
      avail. Seuss would go on to publish &ldquo;Green Eggs and Ham&rdquo; three
      years later in 1960 which garnered critical acclaim.
    </p>
    <h3>2003 Live Movie Adaptation</h3>
    <p>Widely regarded as a cult classic.</p>
    <h2>Hoy&rsquo;s Self-Defenestration of 1993</h2>
    <p>This is an event.</p>
    <h2>Other Notable Events</h2>
    <p>1. The Hyatt Regency Walkway Collapse</p>
    <p>2. George Miller&rsquo;s Happy Feet</p>
    <p>3. Birth of Shadow Wong circa 2012</p>
    <h2>Conclusion</h2>
    <p>Good times.</p>
   </body>
</html>

However actual has h2 and all its children repeated

<html>
  <body>
    <h1>History</h1>
    <h2>Jenny</h2>
    <h3>Birth</h3>
    <p>
      Ship of Fools is a satirical allegory in German verse published in 1949 in
      Basel, Switzerland by the humanist and theologian Segbastian Brant. An
      infant Jenny was discovered on board in a wicker basket and tossed
      overboard to make room for Lord Nelson&rsquo;s barrel. She miraculously
      floated to an island which was later named Australia.
    </p>
    <h2>Cat in the Hat</h2>
    <h3>Dr Seuss Novel</h3>
    <p>
      &ldquo;The Cat in the Hat&rdquo; was well-received however, future
      historians agree that this was not Dr Seuss&rsquo;s best work. Seuss tried
      again a year later to garner better affections from his audience by
      publishing its sequel &ldquo;The Cat in the Hat Comes Back&rdquo; to no
      avail. Seuss would go on to publish &ldquo;Green Eggs and Ham&rdquo; three
      years later in 1960 which garnered critical acclaim.
    </p>
    <h3>2003 Live Movie Adaptation</h3>
    <p>Widely regarded as a cult classic.</p>
    <h2>Hoy&rsquo;s Self-Defenestration of 1993</h2>
    <p>This is an event.</p>
    <h2>Other Notable Events</h2>
    <p>1. The Hyatt Regency Walkway Collapse</p>
    <p>2. George Miller&rsquo;s Happy Feet</p>
    <p>3. Birth of Shadow Wong circa 2012</p>
    <h2>Conclusion</h2>
    <p>Good times.</p>
    <h2>Jenny</h2>
    <h3>Birth</h3>
    <p>
      Ship of Fools is a satirical allegory in German verse published in 1949 in
      Basel, Switzerland by the humanist and theologian Segbastian Brant. An
      infant Jenny was discovered on board in a wicker basket and tossed
      overboard to make room for Lord Nelson&rsquo;s barrel. She miraculously
      floated to an island which was later named Australia.
    </p>
    <h3>Birth</h3>
    <p>
      Ship of Fools is a satirical allegory in German verse published in 1949 in
      Basel, Switzerland by the humanist and theologian Segbastian Brant. An
      infant Jenny was discovered on board in a wicker basket and tossed
      overboard to make room for Lord Nelson&rsquo;s barrel. She miraculously
      floated to an island which was later named Australia.
    </p>
    <h2>Cat in the Hat</h2>
    <h3>Dr Seuss Novel</h3>
    <p>
      &ldquo;The Cat in the Hat&rdquo; was well-received however, future
      historians agree that this was not Dr Seuss&rsquo;s best work. Seuss tried
      again a year later to garner better affections from his audience by
      publishing its sequel &ldquo;The Cat in the Hat Comes Back&rdquo; to no
      avail. Seuss would go on to publish &ldquo;Green Eggs and Ham&rdquo; three
      years later in 1960 which garnered critical acclaim.
    </p>
    <h3>2003 Live Movie Adaptation</h3>
    <p>Widely regarded as a cult classic.</p>
    <h3>Dr Seuss Novel</h3>
    <p>
      &ldquo;The Cat in the Hat&rdquo; was well-received however, future
      historians agree that this was not Dr Seuss&rsquo;s best work. Seuss tried
      again a year later to garner better affections from his audience by
      publishing its sequel &ldquo;The Cat in the Hat Comes Back&rdquo; to no
      avail. Seuss would go on to publish &ldquo;Green Eggs and Ham&rdquo; three
      years later in 1960 which garnered critical acclaim.
    </p>
    <h3>2003 Live Movie Adaptation</h3>
    <p>Widely regarded as a cult classic.</p>
    <h2>Hoy&rsquo;s Self-Defenestration of 1993</h2>
    <p>This is an event.</p>
    <h2>Other Notable Events</h2>
    <p>1. The Hyatt Regency Walkway Collapse</p>
    <p>2. George Miller&rsquo;s Happy Feet</p>
    <p>3. Birth of Shadow Wong circa 2012</p>
    <h2>Conclusion</h2>
    <p>Good times.</p>
  </body>
</html>

Similar output for to_text()

Debug from LayoutReader which does not have the same problem:

 header (5) History
- header (1) Jenny
-- header (1) Birth
--- para (0) Ship of Fools is a satirical allegory in German verse published in 1949 in Basel, Switzerland by the humanist and theologian Segbastian Brant.
An infant Jenny was discovered on board in a wicker basket and tossed overboard to make room for Lord Nelson’s barrel.
She miraculously floated to an island which was later named Australia.
- header (2) Cat in the Hat
-- header (1) Dr Seuss Novel
--- para (0) “The Cat in the Hat” was well-received however, future historians agree that this was not Dr Seuss’s best work.
Seuss tried again a year later to garner better affections from his audience by publishing its sequel “The Cat in the Hat Comes Back” to no avail.
Seuss would go on to publish “Green Eggs and Ham” three years later in 1960 which garnered critical acclaim.
-- header (1) 2003 Live Movie Adaptation
--- para (0) Widely regarded as a cult classic.
- header (1) Hoy’s Self-Defenestration of 1993
-- para (0) This is an event.
- header (3) Other Notable Events
-- para (0) 1. The Hyatt Regency Walkway Collapse
-- para (0) 2. George Miller’s Happy Feet
-- para (0) 3. Birth of Shadow Wong circa 2012
- header (1) Conclusion
-- para (0) Good times.

Additional notes:

jinkjonks commented 1 month ago

I wrote this workaround which extends Document

AaryanTR commented 1 month ago

I also found the same behaviour. I was trying to parse this pdf and noticed that some of the text appears twice in the output of Document.to_text() method.

The issue occurs when the document tree has the following (or similar) structure:

Section-1
├── Section-2
├── Section-3

Here is the current implementation of the to_text method:

def to_text(self):
    """
    Returns text of a document by iterating through all the sections '\n'
    """
    text = ""
    for section in self.sections():
        text = text + section.to_text(include_children=True, recurse=True) + "\n"
    return text

Therefore, the text of Section-2 is included in the output when the to_text method is called (recursively) for Section-1 as well as forSection-2. Similarly, the text of Section-3 is also duplicated in the output.

AaryanTR commented 1 month ago

I have created a pull request #83 which fixes this issue.