Open anjackson opened 1 year ago
Some notes, from a 'good' and a 'bad' ePub.
A good one looks like:
[root@sh ~]# hexdump -C temp.zip | head
00000000 50 4b 03 04 0a 00 00 00 00 00 9b 9e 97 3c a1 8f |PK...........<..|
00000010 72 cc 16 00 00 00 16 00 00 00 08 00 00 00 6d 69 |r.............mi|
00000020 6d 65 74 79 70 65 61 70 70 6c 69 63 61 74 69 6f |metypeapplicatio|
00000030 6e 2f 65 70 75 62 2b 7a 69 70 0d 0a 50 4b 03 04 |n/epub+zip..PK..|
00000040 0a 00 00 00 00 00 9b 9e 97 3c 00 00 00 00 00 00 |.........<......|
00000050 00 00 00 00 00 00 09 00 00 00 4d 45 54 41 2d 49 |..........META-I|
00000060 4e 46 2f 50 4b 03 04 14 00 00 00 08 00 9b 9e 97 |NF/PK...........|
00000070 3c 8f c9 f9 75 ae 00 00 00 01 01 00 00 16 00 00 |<...u...........|
00000080 00 4d 45 54 41 2d 49 4e 46 2f 63 6f 6e 74 61 69 |.META-INF/contai|
00000090 6e 65 72 2e 78 6d 6c 5d 8e c1 0a c2 30 10 44 ef |ner.xml]....0.D.|
the 'bad' one?
[root@sh ~]# hexdump -C temp-bad.zip | head -20
00000000 50 4b 03 04 14 00 00 00 08 00 15 3c 54 4c 2e 49 |PK.........<TL.I|
00000010 84 c8 9b 00 00 00 e8 00 00 00 16 00 00 00 4d 45 |..............ME|
00000020 54 41 2d 49 4e 46 2f 63 6f 6e 74 61 69 6e 65 72 |TA-INF/container|
00000030 2e 78 6d 6c 55 8e 41 0a c2 30 10 45 f7 85 de a1 |.xmlU.A..0.E....|
00000040 64 2b 6d 74 1b 9a 7a 04 05 4f 30 a6 53 0d 26 99 |d+mt..z..O0.S.&.|
00000050 21 49 45 6f 6f 2a 52 ec 72 f8 8f f7 a6 3f be bc |!IEoo*R.r....?..|
00000060 6b 9e 18 93 a5 a0 c5 a1 db 8b e3 50 57 bd a1 90 |k..........PW...|
00000070 c1 06 8c db ad ae 0a 1e 92 16 73 0c 8a 20 d9 a4 |..........s.. ..|
00000080 02 78 4c 2a 1b 45 8c 61 24 33 7b 0c 59 7d 31 b5 |.xL*.E.a$3{.Y}1.|
00000090 5a c4 e2 8c 44 79 b2 0e d3 ff d1 4c b3 73 2d 43 |Z...Dy.....L.s-C|
000000a0 be 6b 71 3a 5f 24 83 79 c0 0d 3b e2 a9 d4 3c 8e |.kq:_$.y..;...<.|
000000b0 16 da fc 66 d4 02 98 9d 35 90 cb 33 92 f0 ca a9 |...f....5..3....|
000000c0 fd b1 bb 12 13 8d 5c ac 72 d3 90 6b 7f f8 00 50 |......\.r..k...P|
000000d0 4b 03 04 14 00 00 00 00 00 ef 30 aa 3c 6f 61 ab |K.........0.<oa.|
000000e0 2c 14 00 00 00 14 00 00 00 08 00 00 00 6d 69 6d |,............mim|
000000f0 65 74 79 70 65 61 70 70 6c 69 63 61 74 69 6f 6e |etypeapplication|
00000100 2f 65 70 75 62 2b 7a 69 70 50 4b 03 04 14 00 00 |/epub+zipPK.....|
00000110 00 08 00 ba 86 f3 48 20 fd 98 9a 87 3d 00 00 3c |......H ....=..<|
00000120 75 00 00 18 00 00 00 4f 50 53 2f 66 6f 6e 74 73 |u......OPS/fonts|
00000130 2f 47 6f 74 68 61 6d 42 6f 6c 64 2e 6f 74 66 cd |/GothamBold.otf.|
Interestingly, Siegfried doesn't mind, and uses the container signature from PRONOM:
[root@sh ~]# ./sf temp.zip temp-bad.zip
---
siegfried : 1.10.1
scandate : 2023-08-08T16:57:38+01:00
signature : default.sig
created : 2023-05-12T09:10:13Z
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'temp.zip'
filesize : 587228
modified : 2023-08-08T16:56:09+01:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/483'
format : 'ePub format'
version :
mime : 'application/epub+zip'
class : 'Text (Structured)'
basis : 'container name mimetype with byte match at 0, 20'
warning : 'extension mismatch'
---
filename : 'temp-bad.zip'
filesize : 245119196
modified : 2023-08-08T16:40:31+01:00
errors :
matches :
- ns : 'pronom'
id : 'fmt/483'
format : 'ePub format'
version :
mime : 'application/epub+zip'
class : 'Text (Structured)'
basis : 'container name mimetype with byte match at 0, 20'
warning : 'extension mismatch'
So, if we could run Siegfried/Fido on the remote ZIP, we could ID the file pretty effectively.
Found httpio, which provides a file-like API to any HTTP resources that supports range requests, so this could be used to support fast full format identification via a modified version of Fido.
It doesn't look like httpio
is being updated much, but there is a BBC fork that could be used instead if necessary.
Assuming Fido can correctly identify these ePubs...
[root@sh ~]# fido temp-bad.zip
FIDO v1.6.1 (formats-v109.xml, container-signature-20200121.xml, format_extensions.xml)
OK,769,fmt/483,"ePub format","ePub format",245119196,"temp-bad.zip","application/epub+zip","container"
...which it can, then this project could be modified into a more generic Interject service that checks formats and fixes-up Content-Type
headers, opens-up ZIPs, etc. If we take that route, this should probably also absorb the convert-format-to-HTML and other format manipulations that are currently part of ukwa-pywb
.
https://github.com/ukwa/ukwa-pywb/blob/eec2b802213783395890311106aea09ca1630191/config.yaml#L25-L44
One important note on this - the current system avoids direct downloads via URL hacking, because you can only go via ukwa-pywb which proxies requests, and the 'raw' service is only visible on the server side. Pushing the format blocking upstream might allow that to be circumvented.
That said, we primarily rely on the use of secure PCs or the NPLD Player to provide content security, so it's not a critical issue.
Okay, need to split this into short-term (#4) and longer-term.
There are some malformed ePubs that do not have the right file signatures.
And this caused problems during ingest, which means the download has this content type:
Whereas for well-formed ePubs, we see:
The system therefore can't invoke the ePub reader, so they end up as just a download, as they are marked as
text/plain
(which is also the failover mode, seehttps://github.com/ukwa/epub-streamer/blob/05aa44a1d4118a3d6ebddac00193770a1ade6593/streamer.py#L48-L51
Then the format-based download-blocking logic in
ukwa-pywb
fails to block the download (becausetext/plain
is always allowed).It seems the browser realises it's not text and converts it into a download ZIP.
It's not clear how best to resolve this. First issue is that we need to block downloads - fixing things up so the borked ePubs work is a secondary issue.
This service acts as a proxy for all DLS content, meaning PDFs and ePubs mostly. So, when we pass-over the source content-type, perhaps we could intervene if we detect
text/plain
? Or possibly assume it's supposed to be an ePub and repair the interaction by fixing the MIME type?Looking at the problematic file, the MIME type is there, as an uncompressed stream, but is the second entry rather than the first (as the standard requires). Therefore, files like this can still be detected, but require a 'relaxed'/custom signature.