hrbrmstr / docxtractr

:scissors: Extract Tables from Microsoft Word Documents with R
Other
174 stars 29 forks source link

error when read_docx has url argument #10

Closed markdly closed 6 years ago

markdly commented 6 years ago

Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when read_docx has url argument. Minimal reprex:

library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

It looks like the call to download.file is causing this issue

download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

To workaround this I can use mode = "wb"

download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

An alternative workaround is using httr package

library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")

read_docx("myfile.docx")
#> Word document [myfile.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

I thought I should raise this in case any other users have the same problem...

hrbrmstr commented 6 years ago

as noted in the PR note #ty for the issue filing! Now that there's better support for proxies under Windows for curl (and, hence, httr) I agree that it's a better way to go.

hrbrmstr commented 6 years ago

I just pushed up a change which swaps in httr ops for download.file(). Pls give it a go when you get a chance.

markdly commented 6 years ago

Looking good to me now!

# devtools::install_github("hrbrmstr/docxtractr")
library(docxtractr)
read_docx("http://rud.is/dl/1.DOCX")
#> Word document [http://rud.is/dl/1.DOCX]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document