ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
510 stars 69 forks source link

Invalid Font Weight/Illegal character #75

Open mfreyrie opened 4 years ago

mfreyrie commented 4 years ago

Hi, I've been struggling with the import of multiple pdfs. I need to create a corpus, but for some reason I continue getting the same error while using pdftools as a method to extract the texts using the tm package. It works if I try to import just one pdf however. This is what I do:

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)

This is what I get

PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
PDF error: Invalid Font Weight
[...]
PDF error (218): Illegal character <2f> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.

My sessioninfo


> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 
[2] LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] broom_0.5.2     tm_0.7-7        NLP_0.2-0      
 [4] pdftools_2.3    tidytext_0.2.2  forcats_0.4.0  
 [7] stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
[10] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3   
[13] ggplot2_3.2.1   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] qpdf_1.1          tidyselect_0.2.5  slam_0.1-46      
 [4] haven_2.2.0       lattice_0.20-38   colorspace_1.4-1 
 [7] vctrs_0.2.0       generics_0.0.2    SnowballC_0.6.0  
[10] rlang_0.4.2       pillar_1.4.3      glue_1.3.1       
[13] withr_2.1.2       DBI_1.1.0         dbplyr_1.4.2     
[16] modelr_0.1.6      readxl_1.3.1      lifecycle_0.1.0  
[19] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[22] rvest_0.3.5       parallel_3.6.1    tokenizers_0.2.1 
[25] Rcpp_1.0.3        scales_1.1.0      backports_1.1.5  
[28] jsonlite_1.6      fs_1.3.1          askpass_1.1      
[31] hms_0.5.3         stringi_1.4.3     grid_3.6.1       
[34] cli_2.0.1         tools_3.6.1       magrittr_1.5     
[37] lazyeval_0.2.2    janeaustenr_0.1.5 crayon_1.3.4     
[40] pkgconfig_2.0.3   zeallot_0.1.0     Matrix_1.2-17    
[43] xml2_1.2.2        reprex_0.3.0      lubridate_1.7.4  
[46] assertthat_0.2.1  httr_1.4.1        rstudioapi_0.11  
[49] R6_2.4.1          nlme_3.1-140      compiler_3.6.1   

This is an example of the PDFs I'm using. It's this entire batch that doesn't work, also from different sources. 12.pdf

jeroen commented 4 years ago

The example pdf you post works fine for me, I don't get an error:

txt <- pdf_text("~/Downloads/12.pdf")
cat(txt)

Are you sure you aren't accidentally feeding non-pdf files? What does your files variable contain?