Closed ChristophHanck closed 6 years ago
It might be platform-specific. I couldn't replicate the problem on Linux. All German characters were rendered correctly. There was also an upgrade in the underlying libraries, namely, PDFBox and Tabula, so that might have resolved the issue. If you could report on whether the problem persists on the newer version, it would great.
I aim to extract this table: https://www.dropbox.com/s/pqkbmiq4ulr5gkz/Spielestatistik%202017.pdf?dl=0 Sorry for bothering you with this specific file, but since the issue may be with specific encodings I could not quickly come up with a more evident public reproducible example.
Running
leads to issues with German Umlauten (ä, ö, ü) as well as the double s (ß).
The file seems to have an identity-H encoding, which, according to a google search, might be the culprit. I still submit an issue because
does work, suggesting there could be a way to also handle such cases in the approach of pdftools.