Open P0L3 opened 7 months ago
I can reproduce this with on s44168-023-00054-5.pdf:
PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/s44168-023-00054-5.pdf -p 1
...
describe the key findings which are based on an exhaustive search of media articles and social media postings. We find that almost
...
I'm not sure if this is a bug/problem or not. The output of pdf2txt.py by default is a string of utf-8 characters. And "fi" is a perfectly fine unicode characer. Although the output is the same with --codec ascii
, so that might be considered a bug.
With Python it is easy to normalize these to normal characters.
Since for --codec utf-8
this is the expected output, and its easy to fix with Python, I'm unsure about whether this should be changed. Leaving the issue open to collect more opinions.
fi problem
Bug occurs when strings such as: "fi", "ffi", "fl", "ff" are present in text: e.g.: "efficient", "final", "stiff" Example with word "find":
This PDF file (downloaded pdf) was processed with
extract_text_to_fp()
function with default parameters.I suggest detecting such symbols as in ligatures_list.txt and heuristically setting font style to their neighbours. Issue can be dealt with later in preprocessing with different libraries, but opens new set of problems later!