ukwa / epub-streamer

A simple standalone service to stream the contents of zipped ePubs.
Apache License 2.0
0 stars 0 forks source link

Turn into an 'Interject' service that runs format ID & conversions #3

Open anjackson opened 1 year ago

anjackson commented 1 year ago

There are some malformed ePubs that do not have the right file signatures.

F:\epub>java -Xss512k -jar epubcheck-5.1.0/epubcheck.jar vdc_100079335646.0x000001.epub
ERROR(PKG-006): vdc_100079335646.0x000001.epub//F:/epub/vdc_100079335646.0x000001.epub(-1,-1): Mimetype file entry is missing or is not the first file in the archive.
Validating using EPUB version 3.3 rules.

Check finished with errors
Messages: 0 fatals / 1 error / 0 warnings / 0 infos

EPUBCheck completed

And this caused problems during ingest, which means the download has this content type:

Content-Type: text/plain; charset=UTF-8

Whereas for well-formed ePubs, we see:

Content-Type: application/epub+zip

The system therefore can't invoke the ePub reader, so they end up as just a download, as they are marked as text/plain (which is also the failover mode, see

https://github.com/ukwa/epub-streamer/blob/05aa44a1d4118a3d6ebddac00193770a1ade6593/streamer.py#L48-L51

Then the format-based download-blocking logic in ukwa-pywb fails to block the download (because text/plain is always allowed).

It seems the browser realises it's not text and converts it into a download ZIP.

It's not clear how best to resolve this. First issue is that we need to block downloads - fixing things up so the borked ePubs work is a secondary issue.

This service acts as a proxy for all DLS content, meaning PDFs and ePubs mostly. So, when we pass-over the source content-type, perhaps we could intervene if we detect text/plain? Or possibly assume it's supposed to be an ePub and repair the interaction by fixing the MIME type?

Looking at the problematic file, the MIME type is there, as an uncompressed stream, but is the second entry rather than the first (as the standard requires). Therefore, files like this can still be detected, but require a 'relaxed'/custom signature.

anjackson commented 1 year ago

Some notes, from a 'good' and a 'bad' ePub.

A good one looks like:

[root@sh ~]# hexdump -C temp.zip | head
00000000  50 4b 03 04 0a 00 00 00  00 00 9b 9e 97 3c a1 8f  |PK...........<..|
00000010  72 cc 16 00 00 00 16 00  00 00 08 00 00 00 6d 69  |r.............mi|
00000020  6d 65 74 79 70 65 61 70  70 6c 69 63 61 74 69 6f  |metypeapplicatio|
00000030  6e 2f 65 70 75 62 2b 7a  69 70 0d 0a 50 4b 03 04  |n/epub+zip..PK..|
00000040  0a 00 00 00 00 00 9b 9e  97 3c 00 00 00 00 00 00  |.........<......|
00000050  00 00 00 00 00 00 09 00  00 00 4d 45 54 41 2d 49  |..........META-I|
00000060  4e 46 2f 50 4b 03 04 14  00 00 00 08 00 9b 9e 97  |NF/PK...........|
00000070  3c 8f c9 f9 75 ae 00 00  00 01 01 00 00 16 00 00  |<...u...........|
00000080  00 4d 45 54 41 2d 49 4e  46 2f 63 6f 6e 74 61 69  |.META-INF/contai|
00000090  6e 65 72 2e 78 6d 6c 5d  8e c1 0a c2 30 10 44 ef  |ner.xml]....0.D.|

the 'bad' one?

[root@sh ~]# hexdump -C temp-bad.zip | head -20
00000000  50 4b 03 04 14 00 00 00  08 00 15 3c 54 4c 2e 49  |PK.........<TL.I|
00000010  84 c8 9b 00 00 00 e8 00  00 00 16 00 00 00 4d 45  |..............ME|
00000020  54 41 2d 49 4e 46 2f 63  6f 6e 74 61 69 6e 65 72  |TA-INF/container|
00000030  2e 78 6d 6c 55 8e 41 0a  c2 30 10 45 f7 85 de a1  |.xmlU.A..0.E....|
00000040  64 2b 6d 74 1b 9a 7a 04  05 4f 30 a6 53 0d 26 99  |d+mt..z..O0.S.&.|
00000050  21 49 45 6f 6f 2a 52 ec  72 f8 8f f7 a6 3f be bc  |!IEoo*R.r....?..|
00000060  6b 9e 18 93 a5 a0 c5 a1  db 8b e3 50 57 bd a1 90  |k..........PW...|
00000070  c1 06 8c db ad ae 0a 1e  92 16 73 0c 8a 20 d9 a4  |..........s.. ..|
00000080  02 78 4c 2a 1b 45 8c 61  24 33 7b 0c 59 7d 31 b5  |.xL*.E.a$3{.Y}1.|
00000090  5a c4 e2 8c 44 79 b2 0e  d3 ff d1 4c b3 73 2d 43  |Z...Dy.....L.s-C|
000000a0  be 6b 71 3a 5f 24 83 79  c0 0d 3b e2 a9 d4 3c 8e  |.kq:_$.y..;...<.|
000000b0  16 da fc 66 d4 02 98 9d  35 90 cb 33 92 f0 ca a9  |...f....5..3....|
000000c0  fd b1 bb 12 13 8d 5c ac  72 d3 90 6b 7f f8 00 50  |......\.r..k...P|
000000d0  4b 03 04 14 00 00 00 00  00 ef 30 aa 3c 6f 61 ab  |K.........0.<oa.|
000000e0  2c 14 00 00 00 14 00 00  00 08 00 00 00 6d 69 6d  |,............mim|
000000f0  65 74 79 70 65 61 70 70  6c 69 63 61 74 69 6f 6e  |etypeapplication|
00000100  2f 65 70 75 62 2b 7a 69  70 50 4b 03 04 14 00 00  |/epub+zipPK.....|
00000110  00 08 00 ba 86 f3 48 20  fd 98 9a 87 3d 00 00 3c  |......H ....=..<|
00000120  75 00 00 18 00 00 00 4f  50 53 2f 66 6f 6e 74 73  |u......OPS/fonts|
00000130  2f 47 6f 74 68 61 6d 42  6f 6c 64 2e 6f 74 66 cd  |/GothamBold.otf.|

Interestingly, Siegfried doesn't mind, and uses the container signature from PRONOM:

[root@sh ~]# ./sf temp.zip temp-bad.zip
---
siegfried   : 1.10.1
scandate    : 2023-08-08T16:57:38+01:00
signature   : default.sig
created     : 2023-05-12T09:10:13Z
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'temp.zip'
filesize : 587228
modified : 2023-08-08T16:56:09+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/483'
    format  : 'ePub format'
    version :
    mime    : 'application/epub+zip'
    class   : 'Text (Structured)'
    basis   : 'container name mimetype with byte match at 0, 20'
    warning : 'extension mismatch'
---
filename : 'temp-bad.zip'
filesize : 245119196
modified : 2023-08-08T16:40:31+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/483'
    format  : 'ePub format'
    version :
    mime    : 'application/epub+zip'
    class   : 'Text (Structured)'
    basis   : 'container name mimetype with byte match at 0, 20'
    warning : 'extension mismatch'

So, if we could run Siegfried/Fido on the remote ZIP, we could ID the file pretty effectively.

anjackson commented 1 year ago

Found httpio, which provides a file-like API to any HTTP resources that supports range requests, so this could be used to support fast full format identification via a modified version of Fido.

It doesn't look like httpio is being updated much, but there is a BBC fork that could be used instead if necessary.

Assuming Fido can correctly identify these ePubs...

[root@sh ~]# fido temp-bad.zip
FIDO v1.6.1 (formats-v109.xml, container-signature-20200121.xml, format_extensions.xml)
OK,769,fmt/483,"ePub format","ePub format",245119196,"temp-bad.zip","application/epub+zip","container"

...which it can, then this project could be modified into a more generic Interject service that checks formats and fixes-up Content-Type headers, opens-up ZIPs, etc. If we take that route, this should probably also absorb the convert-format-to-HTML and other format manipulations that are currently part of ukwa-pywb.

https://github.com/ukwa/ukwa-pywb/blob/eec2b802213783395890311106aea09ca1630191/config.yaml#L25-L44

anjackson commented 1 year ago

One important note on this - the current system avoids direct downloads via URL hacking, because you can only go via ukwa-pywb which proxies requests, and the 'raw' service is only visible on the server side. Pushing the format blocking upstream might allow that to be circumvented.

That said, we primarily rely on the use of secure PCs or the NPLD Player to provide content security, so it's not a critical issue.

anjackson commented 1 year ago

Okay, need to split this into short-term (#4) and longer-term.