Open MonteShaffer opened 7 years ago
I have the same problem in a pdf I am working with. The word "first" is output as "rst". The "fi" is being removed. Some cases the word is with capital "F" like "First", other cases it is lowercase like "first". Both cases result in "rst" being output.
Here is an image showing the issue with the word efficient.
https://assets.mypatentideas.com/images/fiddle/efficient.png
The unicode returned is \u002 which is incorrectly a control character [STX]
It should be \uFB01
Similarly [ETX] should be "fl" ... \uFB02
I gave your system a test run, and noticed this...
If I select the text within the PDF (Adobe Acrobat) and copy/paste, it performs as expected. When I use your parser, it does not.