smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.38k stars 535 forks source link

Ligatures [coefcient should be coefficient] #136

Open MonteShaffer opened 7 years ago

MonteShaffer commented 7 years ago

I gave your system a test run, and noticed this...

If I select the text within the PDF (Adobe Acrobat) and copy/paste, it performs as expected. When I use your parser, it does not.

tbolognese commented 7 years ago

I have the same problem in a pdf I am working with. The word "first" is output as "rst". The "fi" is being removed. Some cases the word is with capital "F" like "First", other cases it is lowercase like "first". Both cases result in "rst" being output.

MonteShaffer commented 7 years ago

Here is an image showing the issue with the word efficient.

https://assets.mypatentideas.com/images/fiddle/efficient.png

The unicode returned is \u002 which is incorrectly a control character [STX]

It should be \uFB01

Efficient Bug

MonteShaffer commented 7 years ago

Similarly [ETX] should be "fl" ... \uFB02