ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
545 stars 71 forks source link

identity-H encoding German letters #74

Closed ChristophHanck closed 6 years ago

ChristophHanck commented 6 years ago

I aim to extract this table: https://www.dropbox.com/s/pqkbmiq4ulr5gkz/Spielestatistik%202017.pdf?dl=0 Sorry for bothering you with this specific file, but since the issue may be with specific encodings I could not quickly come up with a more evident public reproducible example.

Running

library(tabulizer)
library(tidyverse)
setwd("...")
"Spielestatistik 2017.pdf" %>% tabulizer::extract_text() -> rawtxt

leads to issues with German Umlauten (ä, ö, ü) as well as the double s (ß).

The file seems to have an identity-H encoding, which, according to a google search, might be the culprit. I still submit an issue because

library(pdftools)
library(stringr)
library(dplyr)
library(tidyr)

setwd("...")
Spiele1516 <- pdf_text("Spielestatistik 2017.pdf")
S1516 <- read.delim(textConnection(Spiele1516), strip.white = T)

does work, suggesting there could be a way to also handle such cases in the approach of pdftools.

tpaskhalis commented 6 years ago

It might be platform-specific. I couldn't replicate the problem on Linux. All German characters were rendered correctly. There was also an upgrade in the underlying libraries, namely, PDFBox and Tabula, so that might have resolved the issue. If you could report on whether the problem persists on the newer version, it would great.