r-lib / zip

Platform independent zip compression via miniz
https://r-lib.github.io/zip/
Other
83 stars 19 forks source link

zipping MS office docs #31

Closed davidgohel closed 5 years ago

davidgohel commented 5 years ago

Hi

With function zipr, I am unable to reproduce the zip files I was producing with zip. I am using it to zip MS office documents.

The following is producing a directory containing the office XML structure in dir_extract:

library(zip)

zip_list("test_zip.docx")

dir_extract <- tempfile()
unzip(zipfile = "test_zip.docx", exdir = dir_extract)

test_zip.docx

The following code will not produce a valid file (but the structure looks ok):

zipr(zipfile = "test_out_0.docx", files = list.files(dir_extract, full.names = TRUE), recurse = TRUE)
zip_list("test_out_0.docx")
                       filename compressed_size uncompressed_size           timestamp permissions
1                        _rels/               0                 0 2019-03-26 17:09:14         700
2                   _rels/.rels             233               590 1979-12-31 23:00:00         600
3           [Content_Types].xml             340              1312 1979-12-31 23:00:00         600
4                     docProps/               0                 0 2019-03-26 17:09:14         700
5              docProps/app.xml             357               709 1979-12-31 23:00:00         600
6             docProps/core.xml             359               749 1979-12-31 23:00:00         600
7                         word/               0                 0 2019-03-26 17:09:14         700
8                   word/_rels/               0                 0 2019-03-26 17:09:14         700
9  word/_rels/document.xml.rels             237               817 1979-12-31 23:00:00         600
10            word/document.xml             661              2562 1979-12-31 23:00:00         600
11           word/fontTable.xml             445              1419 1979-12-31 23:00:00         600
12            word/settings.xml             920              2604 1979-12-31 23:00:00         600
13              word/styles.xml            2563             28902 1979-12-31 23:00:00         600
14                  word/theme/               0                 0 2019-03-26 17:09:14         700
15        word/theme/theme1.xml            1703              8394 1979-12-31 23:00:00         600
16         word/webSettings.xml             288               655 1979-12-31 23:00:00         600

The following code will produce a valid file:

oldwd <- getwd()
setwd(dir_extract)
utils::zip(zipfile = file.path(oldwd, "test_out_1.docx"), 
           files = list.files(path = ".", full.names = TRUE, all.files = FALSE, recursive = FALSE) )
setwd(oldwd)

zip_list("test_out_1.docx")
                       filename compressed_size uncompressed_size           timestamp permissions
1                        _rels/               0                 0 2019-03-26 17:09:16         700
2                   _rels/.rels             233               590 1979-12-31 23:00:00         600
3           [Content_Types].xml             340              1312 1979-12-31 23:00:00         600
4                     docProps/               0                 0 2019-03-26 17:09:16         700
5              docProps/app.xml             357               709 1979-12-31 23:00:00         600
6             docProps/core.xml             359               749 1979-12-31 23:00:00         600
7                         word/               0                 0 2019-03-26 17:09:16         700
8            word/fontTable.xml             445              1419 1979-12-31 23:00:00         600
9             word/document.xml             661              2562 1979-12-31 23:00:00         600
10            word/settings.xml             920              2604 1979-12-31 23:00:00         600
11         word/webSettings.xml             288               655 1979-12-31 23:00:00         600
12              word/styles.xml            2557             28902 1979-12-31 23:00:00         600
13                  word/theme/               0                 0 2019-03-26 17:09:16         700
14        word/theme/theme1.xml            1703              8394 1979-12-31 23:00:00         600
15                  word/_rels/               0                 0 2019-03-26 17:09:16         700
16 word/_rels/document.xml.rels             237               817 1979-12-31 23:00:00         600

The following code was producing a valid file:

oldwd <- getwd()
setwd(dir_extract)
files <- list.files(all.files = TRUE, recursive = TRUE)
zip::zip(zipfile = file.path(oldwd, "test_out_2.docx"), 
           files = files, recurse = TRUE )
setwd(oldwd)

zip_list("test_out_2.docx")
                       filename compressed_size uncompressed_size           timestamp permissions
1                        _rels/               0                 0 2019-03-26 17:09:16         700
2                   _rels/.rels             233               590 1979-12-31 23:00:00         600
3           [Content_Types].xml             340              1312 1979-12-31 23:00:00         600
4                     docProps/               0                 0 2019-03-26 17:09:16         700
5              docProps/app.xml             357               709 1979-12-31 23:00:00         600
6             docProps/core.xml             359               749 1979-12-31 23:00:00         600
7                         word/               0                 0 2019-03-26 17:09:16         700
8            word/fontTable.xml             445              1419 1979-12-31 23:00:00         600
9             word/document.xml             661              2562 1979-12-31 23:00:00         600
10            word/settings.xml             920              2604 1979-12-31 23:00:00         600
11         word/webSettings.xml             288               655 1979-12-31 23:00:00         600
12              word/styles.xml            2557             28902 1979-12-31 23:00:00         600
13                  word/theme/               0                 0 2019-03-26 17:09:16         700
14        word/theme/theme1.xml            1703              8394 1979-12-31 23:00:00         600
15                  word/_rels/               0                 0 2019-03-26 17:09:16         700
16 word/_rels/document.xml.rels             237               817 1979-12-31 23:00:00         600

test_out_0.docx test_out_1.docx test_out_2.docx

Do you have an idea of what is wrong with my zipr usage?

KR David

gaborcsardi commented 5 years ago

I'll fix this, but until then you can use zip() and you can also suppress the deprecation message like this:

❯ zip::zip("/tmp/x.zip", "~/works/zip/DESCRIPTION")
Note: zip::zip() is deprecated, please use zip::zipr() instead

❯ unlink("/tmp/x.zip")
❯ withCallingHandlers(zip::zip("/tmp/x.zip", "~/works/zip/DESCRIPTION"), deprecated = function(e) NULL)

❯ zip::zip_list("/tmp/x.zip")
                 filename compressed_size uncompressed_size           timestamp permissions
1 ~/works/zip/DESCRIPTION             381               595 2019-03-11 17:43:16         644
gaborcsardi commented 5 years ago

@davidgohel What's your os? I cannot reproduce this on macOS with R 3.6.0.

davidgohel commented 5 years ago

I am on mac.

> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.4        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  fr_FR.UTF-8                 
 ctype    fr_FR.UTF-8                 
 tz       Europe/Paris                
 date     2019-05-13                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
gaborcsardi commented 5 years ago

How is the archive broken? Both zip::unzip() and utils::unzip() can uncompress test_out_0.docx for me, without any problems.

davidgohel commented 5 years ago

Issue is with function zip::zipr()

I just rewrote a simpler example and retested it: this is producing a corrupted document:

library(zip)

unlink("coco", force = TRUE, recursive = TRUE)
zip::unzip("~/Downloads/gabor.docx", exdir = "coco")

oldwd <- getwd()
setwd("coco")
zip::zipr(zipfile = file.path(oldwd, "coco.docx"), files = list.files())
setwd(oldwd)

gabor.docx

This is producing a valid document:

library(zip)

unlink("coco", force = TRUE, recursive = TRUE)
utils::unzip("~/Downloads/gabor.docx", exdir = "coco")

oldwd <- getwd()
setwd("coco")
utils::zip(zipfile = file.path(oldwd, "coco.docx"), files = list.files())
setwd(oldwd)

Let me know if you need something...

gaborcsardi commented 5 years ago

Hmmm. Pages behaves exactly the same for me on all three test_out files... complains about missing fonts, I can only cancel, and then it shows an empty document.

davidgohel commented 5 years ago

OK, I can confirm that. It's ok with Pages but not with MS Officer Word

gaborcsardi commented 5 years ago

I do see some differences between the files, these must cause the problems:

❯ zipinfo coco-old.docx
Archive:  coco-old.docx
Zip file size: 9819 bytes, number of entries: 16
drwxr-xr-x  3.0 unx        0 b- stor 19-May-13 14:05 _rels/
-rw-r--r--  3.0 unx      590 t- defX 19-May-13 14:05 _rels/.rels
-rw-r--r--  3.0 unx     1312 t- defX 19-May-13 14:05 [Content_Types].xml
drwxr-xr-x  3.0 unx        0 b- stor 19-May-13 14:05 docProps/
-rw-r--r--  3.0 unx      709 t- defX 19-May-13 14:05 docProps/app.xml
-rw-r--r--  3.0 unx      749 t- defX 19-May-13 14:05 docProps/core.xml
drwxr-xr-x  3.0 unx        0 b- stor 19-May-13 14:05 word/
-rw-r--r--  3.0 unx     1419 t- defX 19-May-13 14:05 word/fontTable.xml
-rw-r--r--  3.0 unx     2562 t- defX 19-May-13 14:05 word/document.xml
-rw-r--r--  3.0 unx     2604 t- defX 19-May-13 14:05 word/settings.xml
-rw-r--r--  3.0 unx      655 t- defX 19-May-13 14:05 word/webSettings.xml
-rw-r--r--  3.0 unx    28902 t- defX 19-May-13 14:05 word/styles.xml
drwxr-xr-x  3.0 unx        0 b- stor 19-May-13 14:05 word/theme/
-rw-r--r--  3.0 unx     8394 t- defX 19-May-13 14:05 word/theme/theme1.xml
drwxr-xr-x  3.0 unx        0 b- stor 19-May-13 14:05 word/_rels/
-rw-r--r--  3.0 unx      817 t- defX 19-May-13 14:05 word/_rels/document.xml.rels
16 files, 48713 bytes uncompressed, 8099 bytes compressed:  83.4%

❯ zipinfo coco.docx
Archive:  coco.docx
Zip file size: 10001 bytes, number of entries: 16
drwx------  2.3 unx        0 b- stor 19-May-13 14:06 _rels/
-rw-------  2.3 unx      590 bl defN 80-Jan-01 00:00 _rels/.rels
-rw-------  2.3 unx     1312 bl defN 80-Jan-01 00:00 [Content_Types].xml
drwx------  2.3 unx        0 b- stor 19-May-13 14:06 docProps/
-rw-------  2.3 unx      709 bl defN 80-Jan-01 00:00 docProps/app.xml
-rw-------  2.3 unx      749 bl defN 80-Jan-01 00:00 docProps/core.xml
drwx------  2.3 unx        0 b- stor 19-May-13 14:06 word/
drwx------  2.3 unx        0 b- stor 19-May-13 14:06 word/_rels/
-rw-------  2.3 unx      817 bl defN 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw-------  2.3 unx     2562 bl defN 80-Jan-01 00:00 word/document.xml
-rw-------  2.3 unx     1419 bl defN 80-Jan-01 00:00 word/fontTable.xml
-rw-------  2.3 unx     2604 bl defN 80-Jan-01 00:00 word/settings.xml
-rw-------  2.3 unx    28902 bl defN 80-Jan-01 00:00 word/styles.xml
drwx------  2.3 unx        0 b- stor 19-May-13 14:06 word/theme/
-rw-------  2.3 unx     8394 bl defN 80-Jan-01 00:00 word/theme/theme1.xml
-rw-------  2.3 unx      655 bl defN 80-Jan-01 00:00 word/webSettings.xml
16 files, 48713 bytes uncompressed, 8105 bytes compressed:  83.4%
gaborcsardi commented 5 years ago

Turns out that the difference is actually in unzip, not in zipr. zip::unzip() sets permissions and also zip version, whereas utils::unzip() does not.

Btw. this would have been obvious if we insist on a self-contained reprex, so this is the lesson of the day. :)

❯ utils::unzip("~/Downloads/gabor.docx", exdir = "old")
❯ zip::unzip("~/Downloads/gabor.docx", exdir = "new")
❯ fs::dir_info("old")
# A tibble: 4 x 18
  path       type     size permissions modification_time   user  group device_id
  <fs::path> <fct> <fs::b> <fs::perms> <dttm>              <chr> <chr>     <dbl>
1 old/[Cont… file    1.28K rw-r--r--   2019-05-13 15:39:13 gabo… staff  16777220
2 old/_rels  dire…      96 rwxr-xr-x   2019-05-13 15:39:13 gabo… staff  16777220
3 old/docPr… dire…     128 rwxr-xr-x   2019-05-13 15:39:13 gabo… staff  16777220
4 old/word   dire…     288 rwxr-xr-x   2019-05-13 15:39:13 gabo… staff  16777220
# … with 10 more variables: hard_links <dbl>, special_device_id <dbl>,
#   inode <dbl>, block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
#   access_time <dttm>, change_time <dttm>, birth_time <dttm>

❯ fs::dir_info("new")
# A tibble: 4 x 18
  path       type     size permissions modification_time   user  group device_id
  <fs::path> <fct> <fs::b> <fs::perms> <dttm>              <chr> <chr>     <dbl>
1 new/[Cont… file    1.28K rw-------   1980-01-01 00:00:00 gabo… staff  16777220
2 new/_rels  dire…      96 rwx------   2019-05-13 15:39:21 gabo… staff  16777220
3 new/docPr… dire…     128 rwx------   2019-05-13 15:39:21 gabo… staff  16777220
4 new/word   dire…     288 rwx------   2019-05-13 15:39:21 gabo… staff  16777220
# … with 10 more variables: hard_links <dbl>, special_device_id <dbl>,
#   inode <dbl>, block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
#   access_time <dttm>, change_time <dttm>, birth_time <dttm>

So I guess what you would need is a version of zip::unzip() that does not set permissions and dates, or you can also just use utils::unzip(), it is actually faster, anyway.

davidgohel commented 5 years ago

:) OK, actually 2 lessons for me... (really sorry you spent time on it, I had a look at permissions and did not catch that)

gaborcsardi commented 5 years ago

No worries. :)

davidgohel commented 5 years ago

hello

I ran tests yesterday evening and using utils::unzip does not solve the issue. I can see in the zip_list results the folders when zip::zipr is used but not with zip::zip or utils::zip.

Is there an option to avoid adding the folder?

library(zip)
#> 
#> Attachement du package : 'zip'
#> The following objects are masked from 'package:utils':
#> 
#>     unzip, zip
library(writexl)

write_xlsx(x = list(iris = iris), path = "file.xlsx")
zip_list("file.xlsx")
#>                     filename compressed_size uncompressed_size
#> 1   xl/worksheets/sheet1.xml            3803             26635
#> 2            xl/workbook.xml             327               548
#> 3           docProps/app.xml             374               782
#> 4          docProps/core.xml             294               592
#> 5        xl/theme/theme1.xml            1457              6995
#> 6              xl/styles.xml             451              1106
#> 7        [Content_Types].xml             318              1031
#> 8 xl/_rels/workbook.xml.rels             213               556
#> 9                _rels/.rels             233               587
#>             timestamp permissions
#> 1 1979-12-31 23:00:00         600
#> 2 1979-12-31 23:00:00         600
#> 3 1979-12-31 23:00:00         600
#> 4 1979-12-31 23:00:00         600
#> 5 1979-12-31 23:00:00         600
#> 6 1979-12-31 23:00:00         600
#> 7 1979-12-31 23:00:00         600
#> 8 1979-12-31 23:00:00         600
#> 9 1979-12-31 23:00:00         600

unlink("dir", recursive = TRUE, force = TRUE)
utils::unzip("file.xlsx", exdir = "dir")

# This one is not valid -----
setwd("dir")
zip::zipr(zipfile = "../fromzipr.xlsx", files = list.files(path = ".", all.files = FALSE), recurse = TRUE)
setwd("..")
getwd()
#> [1] "/private/var/folders/08/2qdvv0q95wn52xy6mxgj340r0000gn/T/RtmpqejNAI/reprex10b671589521e"
zip_list("fromzipr.xlsx")
#>                      filename compressed_size uncompressed_size
#> 1                      _rels/               0                 0
#> 2                 _rels/.rels             233               587
#> 3         [Content_Types].xml             315              1031
#> 4                   docProps/               0                 0
#> 5            docProps/app.xml             374               782
#> 6           docProps/core.xml             294               592
#> 7                         xl/               0                 0
#> 8                   xl/_rels/               0                 0
#> 9  xl/_rels/workbook.xml.rels             213               556
#> 10              xl/styles.xml             451              1106
#> 11                  xl/theme/               0                 0
#> 12        xl/theme/theme1.xml            1451              6995
#> 13            xl/workbook.xml             327               548
#> 14             xl/worksheets/               0                 0
#> 15   xl/worksheets/sheet1.xml            3540             26635
#>              timestamp permissions
#> 1  2019-05-14 07:26:12         755
#> 2  2019-05-14 07:26:12         644
#> 3  2019-05-14 07:26:12         644
#> 4  2019-05-14 07:26:12         755
#> 5  2019-05-14 07:26:12         644
#> 6  2019-05-14 07:26:12         644
#> 7  2019-05-14 07:26:12         755
#> 8  2019-05-14 07:26:12         755
#> 9  2019-05-14 07:26:12         644
#> 10 2019-05-14 07:26:12         644
#> 11 2019-05-14 07:26:12         755
#> 12 2019-05-14 07:26:12         644
#> 13 2019-05-14 07:26:12         644
#> 14 2019-05-14 07:26:12         755
#> 15 2019-05-14 07:26:12         644

# This one is ok -----
setwd("dir")
zip::zip(zipfile = "../fromzipr.xlsx", files = list.files(all.files = TRUE, recursive = TRUE))
#> Note: zip::zip() is deprecated, please use zip::zipr() instead
setwd("..")
getwd()
#> [1] "/private/var/folders/08/2qdvv0q95wn52xy6mxgj340r0000gn/T/RtmpqejNAI/reprex10b671589521e"
zip_list("fromzipr.xlsx")
#>                     filename compressed_size uncompressed_size
#> 1                _rels/.rels             233               587
#> 2        [Content_Types].xml             315              1031
#> 3           docProps/app.xml             374               782
#> 4          docProps/core.xml             294               592
#> 5 xl/_rels/workbook.xml.rels             213               556
#> 6              xl/styles.xml             451              1106
#> 7        xl/theme/theme1.xml            1451              6995
#> 8            xl/workbook.xml             327               548
#> 9   xl/worksheets/sheet1.xml            3540             26635
#>             timestamp permissions
#> 1 2019-05-14 07:26:12         644
#> 2 2019-05-14 07:26:12         644
#> 3 2019-05-14 07:26:12         644
#> 4 2019-05-14 07:26:12         644
#> 5 2019-05-14 07:26:12         644
#> 6 2019-05-14 07:26:12         644
#> 7 2019-05-14 07:26:12         644
#> 8 2019-05-14 07:26:12         644
#> 9 2019-05-14 07:26:12         644

Created on 2019-05-14 by the reprex package (v0.2.1)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.0 (2019-04-26) #> os macOS Mojave 10.14.4 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate fr_FR.UTF-8 #> ctype fr_FR.UTF-8 #> tz Europe/Paris #> date 2019-05-14 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0) #> callr 3.2.0 2019-03-15 [1] CRAN (R 3.6.0) #> cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.0.2 2019-04-08 [1] CRAN (R 3.6.0) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.6.0) #> evaluate 0.13 2019-02-12 [1] CRAN (R 3.6.0) #> fs 1.2.7 2019-03-19 [1] CRAN (R 3.6.0) #> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.0) #> knitr 1.22 2019-03-08 [1] CRAN (R 3.6.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.6.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) #> processx 3.3.0 2019-03-10 [1] CRAN (R 3.6.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) #> R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0) #> Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.6.0) #> remotes 2.0.4 2019-04-10 [1] CRAN (R 3.6.0) #> rlang 0.3.4 2019-04-07 [1] CRAN (R 3.6.0) #> rmarkdown 1.12 2019-03-14 [1] CRAN (R 3.6.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> testthat 2.1.1 2019-04-23 [1] CRAN (R 3.6.0) #> usethis 1.5.0 2019-04-07 [1] CRAN (R 3.6.0) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) #> writexl * 1.1 2018-12-02 [1] CRAN (R 3.6.0) #> xfun 0.6 2019-04-02 [1] CRAN (R 3.6.0) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0) #> zip * 2.0.1 2019-03-11 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```
gaborcsardi commented 5 years ago

Oh, right! Makes sense. Yes, I can add an option to omit the folders: #34.