I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.
In the next code, the pages still have all the content:
soup = BeautifulSoup(str(tika_html_doc), "html.parser")
print("Soup", soup)
meta_tags = soup.find_all("meta")
title = None
for tag in meta_tags:
if tag["name"].endswith(":title"):
title = tag["content"]
break
pages = soup.find_all("div", class_=lambda x: x in ["page"])
print("Pages", pages)
However, the blocks inside the parsed document are missing some content:
I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.
Any help is appreciated
Example pdf:
Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.
I’ve encountered an issue where some text is missing after parsing certain PDFs. In the attached example, the text USD 700 disappears during the parsing process.
In the next code, the pages still have all the content:
However, the blocks inside the parsed document are missing some content:
I wasn’t able to debug it completely, but I believe the problem lies within the parse function. I’m not certain if this is a bug or if BeautifulSoup is misinterpreting USD 700 as a header when it clearly isn’t. The main problem is that in this example the ignored text was a kinda important title, so it was nothing resembling a Header really.
Any help is appreciated
Example pdf:
Here is the pdf that has been causing me problems. It's not complete for privacy reasons, but it's the minimum example I found. that causes this problem. If I edit it a little bit, for example adding text next to the USD it won't cause this problem.
problematic-pdf.pdf