Closed ChrisMuir closed 6 years ago
just adding a small note to this: tried this on R on linux (Ubuntu) and got the same ans. ---- empty strings...
@soodoku pointed out to me that the pdf url from pibuzz.com is now dead...so here's the same info on the document from https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf:
identical(
pdftools::pdf_text("https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf"),
rep("", 1057)
)
#> TRUE
Output from pdf_info
:
pdftools::pdf_info("https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf")
$version
[1] "1.4"
$pages
[1] 1057
$encrypted
[1] FALSE
$linearized
[1] TRUE
$keys
$keys$Producer
[1] "PDF Engine win32 - (10.2)"
$created
[1] "2018-01-20 04:21:34 CST"
$modified
[1] "2106-02-07 00:28:15 CST"
$metadata
[1] ""
$locked
[1] FALSE
$attachments
[1] FALSE
$layout
[1] "no_layout"
Can you test if this is fixed in the devel version?
devtools::install_github("ropensci/pdftools")
Ahh, cool, on Win the jacked up PDF's work great with dev version (both from file and from url). pdf_text
is yielding the same number of pages as pdftotext
.
I don't have my Mac atm, but will try to test there soon as well.
Thanks Jeroen!
Hi Jeroen,
I'm having an issue in which calling
pdftools::pdf_text
on a specific set of files is returning nothing but a single empty string("")
per page. I'm able to read in other PDF documents without issue. What makes this especially weird is that as of a week ago, there were no issues while working with these files on my Mac, and then a some point about a week ago this issue started and is persisting on both Mac and PC. No updates to R, pdftools, or any other software I can think of that would cause this on both machines.The files are all similar, they're public data sets of Idaho payroll expenses. Here they are: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf http://mediad.publicbroadcasting.net/p/kisu/files/workforce.pdf https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf
For example, the first file contains 1012 pages:
Downloading first and then reading in the local file gives the same result.
Using
pdftotext
at the command line works great (works using either flag-layout
or-table
).Here's the output of
pdf_info
:And here's session info from my PC:
Let me know if you have questions or need more info from me. Also, thanks for all your work on this package and other tools!