ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

pdf_text returns empty strings for specific files #24

Closed ChrisMuir closed 6 years ago

ChrisMuir commented 6 years ago

Hi Jeroen,

I'm having an issue in which calling pdftools::pdf_text on a specific set of files is returning nothing but a single empty string ("") per page. I'm able to read in other PDF documents without issue. What makes this especially weird is that as of a week ago, there were no issues while working with these files on my Mac, and then a some point about a week ago this issue started and is persisting on both Mac and PC. No updates to R, pdftools, or any other software I can think of that would cause this on both machines.

The files are all similar, they're public data sets of Idaho payroll expenses. Here they are: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf http://mediad.publicbroadcasting.net/p/kisu/files/workforce.pdf https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf

For example, the first file contains 1012 pages:

identical(
  pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"), 
  rep("", 1012)
)
#> TRUE

Downloading first and then reading in the local file gives the same result.

Using pdftotext at the command line works great (works using either flag -layout or -table).

Here's the output of pdf_info:

pdftools::pdf_info("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf")
$version
[1] "1.4"

$pages
[1] 1012

$encrypted
[1] FALSE

$linearized
[1] TRUE

$keys
$keys$Producer
[1] "PDF Engine win32 - (10.1)"

$created
[1] "2013-11-19 04:39:33 CST"

$modified
[1] "2013-11-19 20:45:26 CST"

$metadata
[1] "<?xpacket begin=\"\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?>\n<x:xmpmeta xmlns:x=\"adobe:ns:meta/\" x:xmptk=\"Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04        \">\n   <rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n      <rdf:Description rdf:about=\"\"\n            xmlns:xmp=\"http://ns.adobe.com/xap/1.0/\">\n         <xmp:CreateDate>2013-11-19T03:39:33-07:00</xmp:CreateDate>\n         <xmp:ModifyDate>2013-11-19T18:45:26-08:00</xmp:ModifyDate>\n         <xmp:MetadataDate>2013-11-19T18:45:26-08:00</xmp:MetadataDate>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:pdf=\"http://ns.adobe.com/pdf/1.3/\">\n         <pdf:Producer>PDF Engine win32 - (10.1)</pdf:Producer>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:dc=\"http://purl.org/dc/elements/1.1/\">\n         <dc:format>application/pdf</dc:format>\n      </rdf:Description>\n      <rdf:Description rdf:about=\"\"\n            xmlns:xmpMM=\"http://ns.adobe.com/xap/1.0/mm/\">\n         <xmpMM:DocumentID>uuid:b05f6f5f-4a32-4606-8cc0-4de378ad1853</xmpMM:DocumentID>\n         <xmpMM:InstanceID>uuid:b06b16b2-4fbc-4c99-8435-0fc05830c526</xmpMM:InstanceID>\n      </rdf:Description>\n   </rdf:RDF>\n</x:xmpmeta>\n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                                                                                                    \n                           \n<?xpacket end=\"w\"?>"

$locked
[1] FALSE

$attachments
[1] FALSE

$layout
[1] "no_layout"

And here's session info from my PC:

devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, mingw32             
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Chicago             
 date     2018-01-18                  

Packages --------------------------------------------------------------------------------------------------------------------------------------------------
 package   * version date       source        
 base      * 3.4.3   2017-12-06 local         
 compiler    3.4.3   2017-12-06 local         
 datasets  * 3.4.3   2017-12-06 local         
 devtools    1.13.4  2017-11-09 CRAN (R 3.4.2)
 digest      0.6.14  2018-01-14 CRAN (R 3.4.3)
 graphics  * 3.4.3   2017-12-06 local         
 grDevices * 3.4.3   2017-12-06 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.4.3)
 methods   * 3.4.3   2017-12-06 local         
 pdftools    1.5     2017-11-05 CRAN (R 3.4.2)
 Rcpp        0.12.14 2017-11-23 CRAN (R 3.4.3)
 stats     * 3.4.3   2017-12-06 local         
 tools       3.4.3   2017-12-06 local         
 utils     * 3.4.3   2017-12-06 local         
 withr       2.1.1   2017-12-19 CRAN (R 3.4.3)
 yaml        2.1.16  2017-12-12 CRAN (R 3.4.3)

Let me know if you have questions or need more info from me. Also, thanks for all your work on this package and other tools!

soodoku commented 6 years ago

just adding a small note to this: tried this on R on linux (Ubuntu) and got the same ans. ---- empty strings...

ChrisMuir commented 6 years ago

@soodoku pointed out to me that the pdf url from pibuzz.com is now dead...so here's the same info on the document from https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf:

identical(
  pdftools::pdf_text("https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf"), 
  rep("", 1057)
)
#> TRUE

Output from pdf_info:

pdftools::pdf_info("https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf")
$version
[1] "1.4"

$pages
[1] 1057

$encrypted
[1] FALSE

$linearized
[1] TRUE

$keys
$keys$Producer
[1] "PDF Engine win32 - (10.2)"

$created
[1] "2018-01-20 04:21:34 CST"

$modified
[1] "2106-02-07 00:28:15 CST"

$metadata
[1] ""

$locked
[1] FALSE

$attachments
[1] FALSE

$layout
[1] "no_layout"
jeroen commented 6 years ago

Can you test if this is fixed in the devel version?

devtools::install_github("ropensci/pdftools")
ChrisMuir commented 6 years ago

Ahh, cool, on Win the jacked up PDF's work great with dev version (both from file and from url). pdf_text is yielding the same number of pages as pdftotext.

I don't have my Mac atm, but will try to test there soon as well.

Thanks Jeroen!