pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.82k stars 921 forks source link

Detection of ligatures - fi problem #942

Open P0L3 opened 7 months ago

P0L3 commented 7 months ago

fi problem

Bug occurs when strings such as: "fi", "ffi", "fl", "ff" are present in text: e.g.: "efficient", "final", "stiff" Example with word "find":

# Previous span
<span style="font-family: AdvOT46dcae81; font-size:8px">Some climate groups have employed disruptive but non-violent tactics to draw public attention to the slow progress in reducing
<br/>greenhouse gas emissions. In 2022, a new disruptive tactic emerged: vandalizing art and museums. In this Brief Communication, we
<br/>describe the key </span>

# Current span
<span style="font-family: fb; font-size:8px">fi</span>

# Next span
<span style="font-family: AdvOT46dcae81; font-size:8px">ndings which are based on an exhaustive search of media articles and social media postings. We </span>

This PDF file (downloaded pdf) was processed with extract_text_to_fp() function with default parameters.

I suggest detecting such symbols as in ligatures_list.txt and heuristically setting font style to their neighbours. Issue can be dealt with later in preprocessing with different libraries, but opens new set of problems later!

pietermarsman commented 2 months ago

I can reproduce this with on s44168-023-00054-5.pdf:

PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/s44168-023-00054-5.pdf -p 1
...
describe the key findings which are based on an exhaustive search of media articles and social media postings. We find that almost
...

I'm not sure if this is a bug/problem or not. The output of pdf2txt.py by default is a string of utf-8 characters. And "fi" is a perfectly fine unicode characer. Although the output is the same with --codec ascii, so that might be considered a bug.

With Python it is easy to normalize these to normal characters.

Since for --codec utf-8 this is the expected output, and its easy to fix with Python, I'm unsure about whether this should be changed. Leaving the issue open to collect more opinions.