readium / readium-lcp-server

Repository for the Readium LCP Server
BSD 3-Clause "New" or "Revised" License
73 stars 58 forks source link

The cover image is deflated in the zip directory #205

Closed llemeurfr closed 4 years ago

llemeurfr commented 4 years ago

Found by using zipinfo on e.g. Moby Dick.

In package.opf: <item id="cover-image" properties="cover-image" href="images/9780316000000.jpg" media-type="image/jpeg"/>

In zipinfo: OPS/images/9780316000000.jpg version of encoding software: 2.0 compression method: deflated compression sub-type (deflation): normal

danielweck commented 4 years ago

Sure, but that's because the cover image is not encrypted, right?

danielweck commented 4 years ago

Laurent, I downloaded your accessible-epub3.lcpl from the EDRLab prod fontend:

zipinfo accessible-epub3.lcpl.epub

=>

Zip file size: 4107970 bytes, number of entries: 41
...
-rw-rw-r--  6.3 unx   820269 bl defN 20-Jan-13 16:37 EPUB/covers/9781449328030_lrg.jpg
...
41 files, 4353815 bytes uncompressed, 4102536 bytes compressed:  5.8%

zipinfo -v accessible-epub3.lcpl.epub

=>

Central directory entry #31:
---------------------------

  There are an extra 16 bytes preceding this file.

  EPUB/covers/9781449328030_lrg.jpg

  offset of local header from start of archive:   3485498
                                                  (0000000000352F3Ah) bytes
  file system or operating system of origin:      Unix
  version of encoding software:                   6.3
  minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:   2.0
  compression method:                             deflated
  compression sub-type (deflation):               normal
  file security status:                           not encrypted
  extended local header:                          yes
  file last modified on (DOS date/time):          2020 Jan 13 16:37:32
  32-bit CRC value (hex):                         29ed8392
  compressed size:                                607003 bytes
  uncompressed size:                              820269 bytes
  length of filename:                             33 characters
  length of extra field:                          0 bytes
  length of file comment:                         0 characters
  disk number on which file begins:               disk 1
  apparent file type:                             binary
  Unix file attributes (100664 octal):            -rw-rw-r--
  MS-DOS file attributes (00 hex):                none

  There is no file comment.
danielweck commented 4 years ago

However, there seems to be a problem with the encrypted files: they are deflated in the zip directory, instead they should be stored (just like the first entry mimetype).

danielweck commented 4 years ago

Yep, HTML files are compressed / deflated in the zip directory, which results in larger size! (due to padding). For example:

Central directory entry #36:
---------------------------

  There are an extra 16 bytes preceding this file.

  EPUB/pr01s04.xhtml

  offset of local header from start of archive:   4099690
                                                  (00000000003E8E6Ah) bytes
  file system or operating system of origin:      Unix
  version of encoding software:                   6.3
  minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:   2.0
  compression method:                             deflated
  compression sub-type (deflation):               normal
  file security status:                           not encrypted
  extended local header:                          yes
  file last modified on (DOS date/time):          2020 Jan 13 16:37:32
  32-bit CRC value (hex):                         9ac371eb
  compressed size:                                885 bytes
  uncompressed size:                              880 bytes
  length of filename:                             18 characters
  length of extra field:                          0 bytes
  length of file comment:                         0 characters
  disk number on which file begins:               disk 1
  apparent file type:                             binary
  Unix file attributes (100664 octal):            -rw-rw-r--
  MS-DOS file attributes (00 hex):                none

  There is no file comment.
llemeurfr commented 4 years ago

Shouldn't the cover image be stored (and not deflated) in the zip because deflating it does not provide any benefit regarding size? and this independently of the fact that it is not encrypted.

Re. encrypted Codec files, I see then properly stored (not deflated) in the encrypted EPUBs generated by the LCP server.

For the HTML files that are larger deflated than stored, this is an edge case triggered on small files (880 bytes here). We can live with that IMO.

danielweck commented 4 years ago

I think that the culprit is in the encrypt function:

https://github.com/readium/readium-lcp-server/blob/a57a6e23294b05e9737f71df722745113aeb2da0/pack/pack.go#L193

...should be file.StorageMethod = zip.NoCompression | 0 (i.e. never zip.Deflate | 8) ...which is passed on AddResource of epub.Writer: https://github.com/readium/readium-lcp-server/blob/a57a6e23294b05e9737f71df722745113aeb2da0/epub/writer.go#L43

Also, this call to w.AddResource in writer.go incorrectly uses res.StorageMethod, I think: https://github.com/readium/readium-lcp-server/blob/a57a6e23294b05e9737f71df722745113aeb2da0/epub/writer.go#L89 ...instead, we should test if the ZIP entry is encrypted, and use store instead of deflate:

            if encryption != nil {
                if data, ok := encryption.DataForFile(file.Name); ok {
                    // use zip.NoCompression (0) instead of res.StorageMethod
                }
            }
danielweck commented 4 years ago

Shouldn't the cover image be stored (and not deflated) in the zip because deflating it does not provide any benefit regarding size? and this independently of the fact that it is not encrypted.

Snippet from the full zipinfo I provided in a previous comment:

EPUB/covers/9781449328030_lrg.jpg

  compressed size:                                607003 bytes
  uncompressed size:                              820269 bytes
danielweck commented 4 years ago

Re. encrypted Codec files, I see then properly stored (not deflated) in the encrypted EPUBs generated by the LCP server.

Not with the current EDRLab LCP prod frontend, it seems:

zipinfo accessible-epub3.lcpl.epub

=>

Archive:  accessible-epub3.lcpl.epub
Zip file size: 4107970 bytes, number of entries: 41
-rw-rw-r--  6.3 unx       20 bl stor 20-Jan-13 16:37 mimetype
-rw-rw-r--  6.3 unx     9296 bl defN 20-Jan-13 16:37 EPUB/ch03s02.xhtml
-rw-rw-r--  6.3 unx     5376 bl defN 20-Jan-13 16:37 EPUB/ch03.xhtml
-rw-rw-r--  6.3 unx   109856 bl defN 20-Jan-13 16:37 EPUB/fonts/UbuntuMono-B.ttf
-rw-rw-r--  6.3 unx   129472 bl defN 20-Jan-13 16:37 EPUB/fonts/UbuntuMono-BI.ttf
-rw-rw-r--  6.3 unx   206928 bl defN 20-Jan-13 16:37 EPUB/fonts/FreeSansBold.otf
-rw-rw-r--  6.3 unx   116528 bl defN 20-Jan-13 16:37 EPUB/fonts/UbuntuMono-RI.ttf
-rw-rw-r--  6.3 unx  1284512 bl defN 20-Jan-13 16:37 EPUB/fonts/FreeSerif.otf
-rw-rw-r--  6.3 unx   114208 bl defN 20-Jan-13 16:37 EPUB/fonts/UbuntuMono-R.ttf
-rw-rw-r--  6.3 unx     2864 bl defN 20-Jan-13 16:37 EPUB/ch03s04.xhtml
-rw-rw-r--  6.3 unx    10656 bl defN 20-Jan-13 16:37 EPUB/ch03s05.xhtml
-rw-rw-r--  6.3 unx     4160 bl defN 20-Jan-13 16:37 EPUB/ch02s02.xhtml
-rw-rw-r--  6.3 unx      112 bl defN 20-Jan-13 16:37 EPUB/css/synth.css
-rw-rw-r--  6.3 unx     4928 bl defN 20-Jan-13 16:37 EPUB/css/epub.css
-rw-rw-r--  6.3 unx  1131440 bl defN 20-Jan-13 16:37 EPUB/images/web/epub3_0401.png
-rw-rw-r--  6.3 unx   302576 bl defN 20-Jan-13 16:37 EPUB/images/spi_global_ad.png
-rw-rw-r--  6.3 unx    13872 bl defN 20-Jan-13 16:37 EPUB/ch03s03.xhtml
-rw-rw-r--  6.3 unx     4890 bl defN 20-Jan-13 16:37 EPUB/package.opf
-rw-rw-r--  6.3 unx     4187 bl defN 20-Jan-13 16:37 EPUB/bk01-toc.xhtml
-rw-rw-r--  6.3 unx      304 bl defN 20-Jan-13 16:37 EPUB/co01.xhtml
-rw-rw-r--  6.3 unx      304 bl defN 20-Jan-13 16:37 EPUB/cover.xhtml
-rw-rw-r--  6.3 unx     1152 bl defN 20-Jan-13 16:37 EPUB/index.xhtml
-rw-rw-r--  6.3 unx    21632 bl defN 20-Jan-13 16:37 EPUB/ch02.xhtml
-rw-rw-r--  6.3 unx      736 bl defN 20-Jan-13 16:37 EPUB/pr01s05.xhtml
-rw-rw-r--  6.3 unx     2096 bl defN 20-Jan-13 16:37 EPUB/ch04.xhtml
-rw-rw-r--  6.3 unx      288 bl defN 20-Jan-13 16:37 EPUB/spi-ad.xhtml
-rw-rw-r--  6.3 unx     1872 bl defN 20-Jan-13 16:37 EPUB/pr01.xhtml
-rw-rw-r--  6.3 unx     2992 bl defN 20-Jan-13 16:37 EPUB/ch01.xhtml
-rw-rw-r--  6.3 unx      880 bl defN 20-Jan-13 16:37 EPUB/pr01s02.xhtml
-rw-rw-r--  6.3 unx      880 bl defN 20-Jan-13 16:37 EPUB/pr01s03.xhtml
-rw-rw-r--  6.3 unx   820269 bl defN 20-Jan-13 16:37 EPUB/covers/9781449328030_lrg.jpg
-rw-rw-r--  6.3 unx     3456 bl defN 20-Jan-13 16:37 EPUB/ch02s03.xhtml
-rw-rw-r--  6.3 unx      224 bl defN 20-Jan-13 16:37 EPUB/lexicon/en.pls
-rw-rw-r--  6.3 unx      208 bl defN 20-Jan-13 16:37 EPUB/lexicon/fr.pls
-rw-rw-r--  6.3 unx     2944 bl defN 20-Jan-13 16:37 EPUB/ch01s02.xhtml
-rw-rw-r--  6.3 unx      880 bl defN 20-Jan-13 16:37 EPUB/pr01s04.xhtml
-rw-rw-r--  6.3 unx     1216 bl defN 20-Jan-13 16:37 EPUB/ch03s06.xhtml
-rw-rw-r--  6.3 unx      263 bl defN 20-Jan-13 16:37 META-INF/container.xml
-rw-rw-r--  6.3 unx       62 bl defN 20-Jan-13 16:37 META-INF/calibre_bookmarks.txt
-rw-r--r--  6.3 unx     2676 bl defN 20-Jan-13 16:36 META-INF/license.lcpl
-rw-rw-r--  6.3 unx    32600 bl defN 20-Jan-13 16:37 META-INF/encryption.xml
danielweck commented 4 years ago

For the HTML files that are larger deflated than stored, this is an edge case triggered on small files (880 bytes here). We can live with that IMO.

The fact that the current LCP server Go implementation incorrectly deflates encrypted entries in the zip directory impacts audio/video performance unnecessarily. In fact, there is a penalty for large HTML or CSS files too, when reading a ZIP entry: inflate + decrypt + inflate.

llemeurfr commented 4 years ago

ah, if there is a 25% win in size, we can let it deflated and close this issue.

Re. what you get in the file, I don't see that in my instance of the lcp server:

zipinfo accepub.zip
Archive:  accepub.zip
Zip file size: 4106104 bytes, number of entries: 39
-rw----     2.0 fat       20 bl stor 80-000-00 00:00 mimetype
-rw----     2.0 fat     4284 bl defN 80-000-00 00:00 EPUB/bk01-toc.xhtml
-rw----     2.0 fat     2992 bl stor 80-000-00 00:00 EPUB/ch01.xhtml
-rw----     2.0 fat     2944 bl stor 80-000-00 00:00 EPUB/ch01s02.xhtml
-rw----     2.0 fat    21712 bl stor 80-000-00 00:00 EPUB/ch02.xhtml
-rw----     2.0 fat     4176 bl stor 80-000-00 00:00 EPUB/ch02s02.xhtml
-rw----     2.0 fat     3472 bl stor 80-000-00 00:00 EPUB/ch02s03.xhtml
-rw----     2.0 fat     5392 bl stor 80-000-00 00:00 EPUB/ch03.xhtml
-rw----     2.0 fat     9328 bl stor 80-000-00 00:00 EPUB/ch03s02.xhtml
-rw----     2.0 fat    13920 bl stor 80-000-00 00:00 EPUB/ch03s03.xhtml
-rw----     2.0 fat     2880 bl stor 80-000-00 00:00 EPUB/ch03s04.xhtml
-rw----     2.0 fat    10688 bl stor 80-000-00 00:00 EPUB/ch03s05.xhtml
-rw----     2.0 fat     1216 bl stor 80-000-00 00:00 EPUB/ch03s06.xhtml
-rw----     2.0 fat     2096 bl stor 80-000-00 00:00 EPUB/ch04.xhtml
-rw----     2.0 fat      304 bl stor 80-000-00 00:00 EPUB/co01.xhtml
-rw----     2.0 fat      304 bl stor 80-000-00 00:00 EPUB/cover.xhtml
-rw----     2.0 fat   820269 bl defN 80-000-00 00:00 EPUB/covers/9781449328030_lrg.jpg
-rw----     2.0 fat     4976 bl stor 80-000-00 00:00 EPUB/css/epub.css
-rw----     2.0 fat      112 bl stor 80-000-00 00:00 EPUB/css/synth.css
-rw----     2.0 fat   206928 bl stor 80-000-00 00:00 EPUB/fonts/FreeSansBold.otf
-rw----     2.0 fat  1284512 bl stor 80-000-00 00:00 EPUB/fonts/FreeSerif.otf
-rw----     2.0 fat   109856 bl stor 80-000-00 00:00 EPUB/fonts/UbuntuMono-B.ttf
-rw----     2.0 fat   129472 bl stor 80-000-00 00:00 EPUB/fonts/UbuntuMono-BI.ttf
-rw----     2.0 fat   114208 bl stor 80-000-00 00:00 EPUB/fonts/UbuntuMono-R.ttf
-rw----     2.0 fat   116528 bl stor 80-000-00 00:00 EPUB/fonts/UbuntuMono-RI.ttf
-rw----     2.0 fat   302576 bl stor 80-000-00 00:00 EPUB/images/spi_global_ad.png
-rw----     2.0 fat  1131440 bl stor 80-000-00 00:00 EPUB/images/web/epub3_0401.png
-rw----     2.0 fat     1152 bl stor 80-000-00 00:00 EPUB/index.xhtml
-rw----     2.0 fat      240 bl stor 80-000-00 00:00 EPUB/lexicon/en.pls
-rw----     2.0 fat      224 bl stor 80-000-00 00:00 EPUB/lexicon/fr.pls
-rw----     2.0 fat     4972 bl defN 80-000-00 00:00 EPUB/package.opf
-rw----     2.0 fat     1888 bl stor 80-000-00 00:00 EPUB/pr01.xhtml
-rw----     2.0 fat      880 bl stor 80-000-00 00:00 EPUB/pr01s02.xhtml
-rw----     2.0 fat      896 bl stor 80-000-00 00:00 EPUB/pr01s03.xhtml
-rw----     2.0 fat      880 bl stor 80-000-00 00:00 EPUB/pr01s04.xhtml
-rw----     2.0 fat      736 bl stor 80-000-00 00:00 EPUB/pr01s05.xhtml
-rw----     2.0 fat      288 bl stor 80-000-00 00:00 EPUB/spi-ad.xhtml
-rw----     2.0 fat      269 bl defN 80-000-00 00:00 META-INF/container.xml
-rw----     2.0 fat    32600 bl defN 80-000-00 00:00 META-INF/encryption.xml
39 files, 4351630 bytes uncompressed, 4100956 bytes compressed:  5.8%
llemeurfr commented 4 years ago

The version of the prod frontend is not up-to-date. I have to get it updated asap.