lublak / pdfdataextract

Extract data from a pdf with pure javascript
MIT License
25 stars 5 forks source link

Handling of ligatures #5

Closed florianbepunkt closed 3 years ago

florianbepunkt commented 3 years ago

Describe the bug Ligatures are incorrectly parsed. I can't type ligatures in markdown, but a PDF line with the code &LBFY68Dd6fb884c (with a ligature of fb) is parsed as &LBFY68Dd6\"884c

Bildschirmfoto 2021-07-31 um 11 44 24

Expected behavior Ligatures should be parsed to regular characters, ligature fb should be parsed as fb

Environment (please complete the following information):

Additional context Not sure if all fonts encode ligatures the same way

lublak commented 3 years ago

@florianbepunkt hi and thank you for your bug report, do you happen to have a pdf file that you can make available to me?

lublak commented 3 years ago

@florianbepunkt is this issue still significant for you? Currently it is difficult for me to understand it. So again the question if you have a pdf file for me or how I can cause this problem myself.

florianbepunkt commented 3 years ago

@lublak Sorry for the late reply. I investigated this further today. It seems that the problem is either with my PDF or the embedded font encoding, not with this library per se. Sorry for the fuss.