jgm / zip-archive

Native Haskell library for working with zip archives
Other
45 stars 27 forks source link

"Couldn't extract ePub file" #63

Closed phiresky closed 1 year ago

phiresky commented 1 year ago

Explain the problem.

pandoc cannot read this epub file. It doesn't give any more information than that one line. I can extract the same epub without issues as a zip file as well as with ebook-convert from Calibre.

❯ pandoc -i s.epub --verbose
Couldn't extract ePub file

Pandoc version? pandoc 3.1, Arch Linux

Input epub file: Software Design X-Rays Fix Technical Debt with Behavioral Code Analysis by Adam Tornhill.zip

jgm commented 1 year ago

I added some additional error reporting and now get

Couldn't extract ePub file: getWordsTilSig: signature not found before EOF

The problem lies in extracting the zip container, and this message is generated by jgm/zip-archive, which evidently doesn't think this is a valid zip. (unzip has no trouble unpacking it, however.)

jgm commented 1 year ago

From zip-archive code:

  skip (fromIntegral extraFieldLength) -- extra field
  compressedData <- if bitflag .&. 0O10 == 0
      then getLazyByteString (fromIntegral compressedSize)
      else -- If bit 3 of general purpose bit flag is set,
           -- then we need to read until we get to the
           -- data descriptor record.  We assume that the
           -- record has signature 0x08074b50; this is not required
           -- by the specification but is common.
           do raw <- getWordsTilSig 0x08074b50

I wonder if this is a case where the signature is different?

jgm commented 1 year ago

How was this epub produced, do you happen to know?

I unpacked it using zip unzip -d sd Software\ Design etc. and then repacked it cd sd; zip -r ../sd.epub *. pandoc was then able to handle the repacked sd.epub.

jgm commented 1 year ago

I'll transfer this to zip-archive. This code was added in https://github.com/jgm/zip-archive/pull/29/commits/4d66754458755e3279932608256d3a0d66830021

@mistmist if you're still out there, perhaps you could take a look?

jgm commented 1 year ago

Grepping shows that we don't have the signature "P K 07 08" in this file.

jgm commented 1 year ago

The documentation says

4.3.9.3 Although not originally assigned a signature, the value 0x08074b50 has commonly been adopted as a signature value for the data descriptor record. Implementers SHOULD be aware that ZIP files MAY be encountered with or without this signature marking data descriptors and SHOULD account for either case when reading ZIP files to ensure compatibility.

So I guess this is just a case where we don't have a signature. So I assume the data description is just the last 12 bytes before the start of the next local file (or something else e.g. sig 0x08064b50 or 0x02014b50).

mistmist commented 1 year ago

sorry for the inconvenience, thanks @jgm for fixing this! (i was hoping to look into it this weekend)