nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.02k stars 141 forks source link

bug: missing text from parsed pdf #75

Open fede-bello opened 1 month ago

fede-bello commented 1 month ago

I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.

In the next code, the pages still have all the content:

soup = BeautifulSoup(str(tika_html_doc), "html.parser")
print("Soup", soup)
meta_tags = soup.find_all("meta")
title = None
for tag in meta_tags:
    if tag["name"].endswith(":title"):
        title = tag["content"]
        break
pages = soup.find_all("div", class_=lambda x: x in ["page"])
print("Pages", pages)

However, the blocks inside the parsed document are missing some content:

parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
print("Parsed_doc", parsed_doc.blocks)

I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.

Any help is appreciated

Example pdf:

Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.

problematic-pdf.pdf