ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

pdftools::pdf_text not recognizing all spaces #100

Open JackoBill opened 2 years ago

JackoBill commented 2 years ago

I'm trying to extract biathlon results from pdf files but a few of them have proven to be difficult: some white spaces seem to disappear. Here's an example:

# install.packages("pdftools")
library(pdftools)

test<-pdf_text("https://ibu.blob.core.windows.net/docs/2021/BT/SWRL/CP01/SMIN/C77A_v1.pdf")
test[[1]]

Here's a piece of the original pdf:

stack_ibu

For example, the green one converts to "70 0" as it should but the red one turns into "70". One can check that both of them have a space by copying and pasting them to text editor. A quick look suggests the problem occurs when the first number (rank of shooting time) has only one digit.