ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

filename encoding bug on Windows #69

Open msgoussi opened 5 years ago

msgoussi commented 5 years ago

I got an error when i use pdftools::pdf_combine

Error in cpp_pdf_combine(input, output, password) : open D:\Databases_Files\IsDB_SS\Files\047. Côte d'Ivoire.pdf: No such file or directory

one of the files name is writen as C\xf4te d'Ivoire.pdf how can i solve this problem. another package solved this problem by adding enc2native(path). Thanks

jeroen commented 5 years ago

Try to use a regular simple ascii filename without diacritics.

msgoussi commented 5 years ago

you solved this issue in magick. https://github.com/ropensci/magick/issues/168.

jeroen commented 5 years ago

Yes if you include your sessionInfo() and an example file + code I will look into it the next time I work on this package.

msgoussi commented 5 years ago

extract some pages

pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf', pages = 1:3, output = "subset.pdf")

Should say 3

pdf_length("subset.pdf")

Generate another pdf

pdf("C\xf4te.pdf") plot(mtcars) dev.off()

Combine them with the other one

pdf_combine(c("C\xf4te.pdf", "subset.pdf"), output = "joined.pdf")

Should say 4

pdf_length("joined.pdf")

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] tools tcltk stats4 grid parallel stats graphics grDevices datasets [10] utils methods base

other attached packages: [1] zip_2.0.3 yaml_2.2.0 XML_3.98-1.20
[4] WriteXLS_5.0.0 wkb_0.3-0 viridis_0.5.1
[7] viridisLite_0.3.0 varhandle_2.0.3 utf8_1.1.4
[10] Unicode_12.0.0-1 toOrdinal_1.1-0.0 tm.plugin.webmining_1.3 [13] tm_0.7-6 NLP_0.2-0 forcats_0.4.0
[16] dplyr_0.8.3 purrr_0.3.2 readr_1.3.1
[19] tidyr_0.8.3 tibble_2.1.3 tidyverse_1.2.1
[22] tcltk2_1.2-11 tau_0.0-21 tabulizer_0.2.2
[25] svMisc_1.1.0 svDialogs_1.0.0 stringi_1.4.3
[28] startup_0.12.0 staplr_2.9.0 slackr_1.4.2
[31] shinyWidgets_0.4.8 shinyTree_0.2.7 shinythemes_1.1.2
[34] RSelenium_1.7.5 rsvg_1.3 rvest_0.3.4
[37] xml2_1.2.2 rstudioapi_0.10 RSQLite_2.1.2
[40] Rserve_1.7-3.1 rsconnect_0.8.15 rvg_0.2.1
[43] RMySQL_0.10.17 rlist_0.4.6.1 rlang_0.4.0
[46] rJava_0.9-11 Rilostat_1.0.1 reshape2_1.4.3
[49] reprex_0.3.0 rebus_0.1-3 readxl_1.3.1
[52] readstata13_0.9.2 rdrop2_0.8.1 RCurl_1.95-4.12
[55] bitops_1.0-6 R2wd_1.5 R.utils_2.9.0
[58] R.oo_1.22.0 R.methodsS3_1.7.1 purrrlyr_0.0.5
[61] qdap_2.3.2 RColorBrewer_1.1-2 qdapTools_1.3.3
[64] qdapRegex_0.7.2 qdapDictionaries_1.0.7 PKI_0.1-5.1
[67] base64enc_0.1-3 packrat_0.5.0 plyr_1.8.4
[70] psych_1.8.12 profvis_0.3.6 pryr_0.1.4
[73] progress_1.2.2 pivottabler_1.2.1 pdftools_2.2
[76] pbapply_1.4-1 party_1.3-3 strucchange_1.5-1
[79] sandwich_2.5-1 zoo_1.8-6 modeltools_0.2-22
[82] mvtnorm_1.0-11 packcircles_0.3.3 openxlsx_4.1.0.1
[85] officer_0.3.5 miniUI_0.1.1.1 microbenchmark_1.4-6
[88] mailR_0.4.1 markdown_1.1 magrittr_1.5
[91] magick_2.2 lubridate_1.7.4 knitr_1.24
[94] jsonlite_1.6 installr_0.22.0 stringr_1.4.0
[97] iotools_0.2-5 httr_1.4.1 Hmisc_4.2-0
[100] ggplot2_3.2.1 Formula_1.2-3 survival_2.44-1.1
[103] lattice_0.20-38 gtools_3.8.1 googledrive_1.0.0
[106] githubinstall_0.2.2 ggvis_0.4.4 ggiraph_0.6.1
[109] gdata_2.18.0 formattable_0.2.1 ffbase_0.12.7
[112] ff_2.2-14 feather_0.3.3 esquisse_0.2.2
[115] easypackages_0.1.0 downloader_0.4 doSNOW_1.0.18
[118] snow_0.4-3 doParallel_1.0.15 iterators_1.0.12
[121] foreach_1.4.7 devtools_2.1.0 usethis_1.5.1
[124] devEMF_3.6-3 DBI_1.0.0 DataCombine_0.2.21
[127] curl_4.0 crayon_1.3.4 cowplot_1.0.0
[130] cellranger_1.1.0 bit64_0.9-7 bit_1.1-14
[133] benchmarkme_1.0.2 beepr_1.3 ARTIVA_1.2.3
[136] gplots_3.0.1.1 MASS_7.3-51.4 animation_2.6
[139] xmltools_1.0 r2excel_1.0.0 xlsx_0.6.1
[142] shiny.info_0.1 RDCOMClient_0.94-0 tclish_1.0.2
[145] directlabels_2018.05.22 flipAPI_0.1 flipChartBasics_2.0.1
[148] threejs_0.3.1 igraph_1.2.4.1 addinexamples_0.1.0
[151] rsdmx_0.5-13 rio_0.5.16 gmailr_1.0.0
[154] lobstr_1.1.1 jsonview_0.2.0 data.table_1.12.3
[157] ConvCalendar_1.2 archive_1.0.0

loaded via a namespace (and not attached): [1] ps_1.3.0 rprojroot_1.3-2 nlme_3.1-141
[4] backports_1.1.4 extrafontdb_1.0 callr_3.3.1
[7] rebus.base_0.0-3 extrafont_0.17 glue_1.3.1
[10] rgeolocate_1.0.1 processx_3.4.1 haven_2.1.1
[13] tidyselect_0.2.5 flipTime_2.9.0 flipTransformations_1.6.9 [16] chron_2.3-54 xtable_1.8-4 evaluate_0.14
[19] gdtools_0.1.9 cli_1.1.0 sp_1.3-1
[22] rpart_4.1-15 fastmatch_1.1-0 wordcloud_2.6
[25] RJSONIO_1.3-1.2 wdman_0.2.4 shiny_1.3.2
[28] xfun_0.9 askpass_1.1 pkgbuild_1.0.5
[31] cluster_2.1.0 caTools_1.17.1.2 png_0.1-7
[34] xlsxjars_0.6.1 zeallot_0.1.0 withr_2.1.2
[37] slam_0.1-45 openNLP_0.2-6 pillar_1.4.2
[40] multcomp_1.4-10 fs_1.3.1 generics_0.0.2
[43] vctrs_0.2.0 qpdf_1.1 flipU_1.2.5
[46] foreign_0.8-72 munsell_0.5.0 compiler_3.6.1
[49] pkgload_1.0.2 httpuv_1.5.1 sessioninfo_1.1.1
[52] plotly_4.9.0 gridExtra_2.3 rebus.numbers_0.0-1
[55] later_0.8.0 sparkline_2.0 semver_0.2.0
[58] scales_1.0.0 hrbrthemes_0.6.0 lazyeval_0.2.2
[61] promises_1.0.1 latticeExtra_0.6-28 checkmate_1.9.4
[64] rmarkdown_1.15 plotrix_3.7-6 htmltools_0.3.6
[67] memoise_1.1.0 quadprog_1.5-7 digest_0.6.20
[70] assertthat_0.2.1 mime_0.7 Rttf2pt1_1.3.7
[73] remotes_2.1.0 blob_1.2.0 openNLPdata_1.5.3-4
[76] rhtmlMetro_0.1.1 splines_3.6.1 broom_0.5.2
[79] rebus.datetimes_0.0-1 hms_0.5.1 modelr_0.1.5
[82] colorspace_1.4-1 mnormt_1.5-5 libcoin_1.0-5
[85] reports_0.1.4 nnet_7.3-12 Rcpp_1.0.2
[88] coin_1.3-1 audio_0.1-6 svGUI_1.0.0
[91] R6_2.4.0 lifecycle_0.1.0 acepack_1.4.1
[94] formatR_1.7 testthat_2.2.1 benchmarkmeData_1.0.2
[97] venneuler_1.1-0 Matrix_1.2-17 tabulizerjars_1.0.1
[100] desc_1.2.0 TH.data_1.0-10 htmlwidgets_1.3
[103] boilerpipeR_1.3 crosstalk_1.0.0 openssl_1.4.1
[106] htmlTable_1.13.1 codetools_0.2-16 matrixStats_0.54.0
[109] binman_0.1.1 prettyunits_1.0.2 gtable_0.3.0
[112] git2r_0.26.1 KernSmooth_2.23-15 uuid_0.1-2
[115] ggthemes_4.2.0 DT_0.8 colorRamps_2.3
[118] gender_0.5.2 rebus.unicode_0.0-2 RGoogleAnalytics_0.1.6
[121] gargle_0.3.1 pkgconfig_2.0.2 flipFormat_1.3.3

msgoussi commented 4 years ago

Hi... any news about tackling this problem

billy34 commented 3 years ago

I had problem with encoding with Windows. pdf_length("file_with_diacritics.pdf") led me to an error (No such file or directory).

I tried with pdf_length(enc2native("file_with_diacritics.pdf")) with no more luck.

I dived into code and saw that this function is reexported from qpdf. So I jumped to it and quickly pointed the culprit.

In pdf_lengtha call is made to get_input that at the end calls the function normalizePath. The problem is that normalizePath reverts the encoding (native=latin1 for me) to UTF8. So it continues to try to open file with UTF-8 encoding !

As a workaround I call directly the function qpdf:::cpp_pdf_length(enc2native("file_with_diacritics.pdf"),"") and it works !

pdf_length <- function(input, password = ""){
  input <- get_input(input)
  cpp_pdf_length(input, password)
}

get_input <- function(path){
   ...
  normalizePath(path, mustWork = TRUE)
}