openpreserve / fido

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
http://openpreservation.org/technology/products/fido/
Other
147 stars 47 forks source link

epub recognized as xls #32

Closed atomotic closed 11 years ago

atomotic commented 11 years ago

tried with several epub files, same behaviour

$ ./fido.py ~/Downloads/Zizek\ -\ Vivere\ alla\ fine\ dei\ tempi.epub

FIDO v1.1.2 (formats-v66.xml, container-signature-20121218.xml, format_extensions.xml)
OK,295,x-fmt/263,"ZIP Format","ZIP format",742241,"/Users/void/Downloads/Zizek - Vivere alla fine dei tempi [Ladri di biblioteche].epub","application/zip","container"
OK,295,fmt/61,"Microsoft Excel 97 Workbook (xls)","BIFF 8 & 8X Workbook (generic)",742241,"/Users/raffaele/Downloads/Zizek - Vivere alla fine dei tempi.epub","application/vnd.ms-excel","container"
FIDO: Processed      1 files in 386.89 msec,  3 files/sec
techmaurice commented 11 years ago

This has probably to do with the fact there is not a signature available yet for epub.

Looking at the container signature for fmt/61 this is probably because this particular signature has the same bytes on certain positions that are also in your epub files.

Could you please send or attach a few epub files so I can take a look at them and possibly create a signature for them?

anjackson commented 11 years ago

There's an ePub signature here:

  <mime-type type="application/epub+zip">
    <acronym>EPUB</acronym>
    <_comment>Electronic Publication</_comment>
    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
        <match value="mimetypeapplication/epub+zip" type="string" offset="30"/>
      </match>
    </magic>
    <glob pattern="*.epub"/>
  </mime-type>
techmaurice commented 11 years ago

Thanks, will add this to the extension xml file.

anjackson commented 11 years ago

I guess you may have to set it up so that this takes precedence over the ZIP signature.

Note that the above signature is consistent with the proposed 'file magic' given in this section of the ePub spec.

techmaurice commented 11 years ago

When added to extensions.xml it has precedence over PRONOM signatures.

Thanks for the link to the spec.

adamfarquhar commented 11 years ago

I guess that the file magic in the epub spec is just too weak to be that useful for identification in a broader context. The test for epub should be strengthened similar to the tests for ooxml, odf, jar or any of the many formats that are also based on zip.

Cheers,

Adam.

From: Andy Jackson [mailto:notifications@github.com] Sent: 29 June 2013 12:44 To: openplanets/fido Subject: Re: [fido] epub recognized as xls (#32)

I guess you may have to set it up so that this takes precedence over the ZIP signature.

Note that the above signature is consistent with the proposed 'file magic' given in this section of the ePub spec. http://www.idpf.org/epub/30/spec/epub30-ocf.html#app-media-type

— Reply to this email directly or view it on GitHub https://github.com/openplanets/fido/issues/32#issuecomment-20228570 .

Adam Farquhar Head of Digital Scholarship Collections Division T:+44 (0)20 7412 7832

Adam.Farquhar@bl.uk The British Library London

NW1 2DB

http://www.bl.uk/ The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

anjackson commented 11 years ago

@adamfarquhar It's not that the ePub sig is not sensitive enough - there is no ePub signature in PRONOM.

adamfarquhar commented 11 years ago

Andy – Yes; I see that the tika signature is precise enough. I had scanned the xml too quickly. Perhaps the easiest fix then would be to get it added to pronom. Can you goose that along? It seems useful and not very controversial to add.

Cheers,

Adam.

From: Andy Jackson [mailto:notifications@github.com] Sent: 30 June 2013 14:35 To: openplanets/fido Cc: Farquhar, Adam Subject: Re: [fido] epub recognized as xls (#32)

@adamfarquhar https://github.com/adamfarquhar It's not that the ePub sig is not sensitive enough - there is no ePub signature in PRONOM http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1270&strPageToDisplay=signatures .

— Reply to this email directly or view it on GitHub https://github.com/openplanets/fido/issues/32#issuecomment-20247206 .

Adam Farquhar Head of Digital Scholarship Collections Division T:+44 (0)20 7412 7832

Adam.Farquhar@bl.uk The British Library London

NW1 2DB

http://www.bl.uk/ The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

anjackson commented 11 years ago

I'll suggest it to David Clipsham. (done)

vladox commented 11 years ago

I think the problem is that fido uses DROID 4, with DROID 6.1 ePub is correctly recognized as "fmt/483".

anjackson commented 11 years ago

Fido does not use DROID 4 - it doesn't use DROID at all. It uses the PRONOM database, which has this entry for ePub. That PRONOM entry only contains a file extension, which is how it identified your ePub file. PRONOM contains no internal 'magic number' signature for ePub, and so cannot identify ePub bytestreams without such contextual hints.

Dclipsham commented 11 years ago

Hi All,

I added a PRONOM container signature as of 18/12/12, but container signatures will not work with DROID 4 (DROID 6 is the minimum). I'll add a binary variant in the next release for backward compatibility, which we aim to produce w/c 22 July in conjunction with the next DROID release (probably 6.1.3)

vladox commented 11 years ago

I have actually found this link: http://www.nationalarchives.gov.uk/PRONOM/fmt/483

The "container" method is used to recognize it, so it seems that fido as to be extended to read the container signature.

From the Source description in that page:

"This format can be identified via a container signature in DROID version 6 or later. The PRONOM database cannot currently represent container signatures."

anjackson commented 11 years ago

Ah, my apologies, I missed the fact that there was a container signature. Fido only partially implements container signature support at present, which is why it doesn't work at the moment.

Kris-LIBIS commented 11 years ago

Hi,

We need this badly. Latest droid does not do the trick either so I worked around this by creating an extension:

  <format>
    <puid>fmt/483</puid>
    <name>ePub format</name>
    <version>1.0</version>
    <alias>EPUB</alias>
    <mime>application/epub+zip</mime>
    <extension>epub</extension>
    <has_priority_over>x-fmt/263</has_priority_over>
    <has_priority_over>fmt/61</has_priority_over>
    <signature>
      <name>EPUB file</name>
      <pattern>
        <position>BOF</position>
        <regex>(?s)\APK\x03\x04</regex>
      </pattern>
      <pattern>
        <position>BOF</position>
        <regex>(?s)\A.{30}mimetypeapplication/epub\+zip</regex>
      </pattern>
    </signature>
    <details/>
  </format>

Maybe this could be added to the fido_extensions.xml until the container signatures work properly in fido?

techmaurice commented 11 years ago

Hi All, thanks for the comments and suggestions.

@Kris-LIBIS: I will publish an update of fido_extensions.xml ASAP, for the time being you could add this ePUB sig to fido_extensions.xml.

And I will investigate why the container signature does not work properly.

techmaurice commented 11 years ago

The ePub signature has been added to fido_extensions.xml, the update has been pushed with the 1.1.6 release.

It seems like the container signature is alright but the precedence in the container signature file is set wrong. The addition of the format information to the extension file fixes this.

Please note FIDO will still report it is a match from the container signature file. Will investigate what is wrong with the container signature file and send this information to PRONOM.

Not closing this issue yet...

techmaurice commented 11 years ago

The bug submitted by @atomotic has been fixed, FIDO now correctly matches ePub files as container-type using the PRONOM container file. The fixed version is tagged and committed as version 1.1.8.

The bug of multiple matches was caused by the read_container() function matching only the first regex where it should have matched all regexes (applicable when the signature consists of more than one regex).

This fix has impact on matches of all signatures of the PRONOM container signature file, please check this if you rely on FIDO in a production environment.

The addtion of the ePub signature to the extension file has been commented out for the time being as this fix seems to tackle the issue.

Please report back if this fixes the issue for you.

Note that the read_container() function is not yet fully compatible with the container signature file and it does not handle them the way DROID does. It is still lacking matching on byte positions and is not yet able to parse OLE2 files the way it should be done.

Dclipsham commented 11 years ago

Backward compatible versions of the signatures for ePub and Apple's iBooks were included in signature release v69, which become available on 19th July. This should assist users tied to older versions of DROID.

David

Kris-LIBIS commented 11 years ago

Hi Maurice,

Fido now correctly recognises the epubs. This did the trick.

Thanks.

Unfortunately a mime type is not included, but that's another problem.

techmaurice commented 11 years ago

Hi Kris,

Thanks for reporting back.

The mime type is not included because the entry is missing in PUID fmt/483 @Dclipsham might want to pick this up?

Dclipsham commented 11 years ago

Will do. Next release will be mid-late September, but I'll ensure this is included.

David

techmaurice commented 11 years ago

Thanks!

techmaurice commented 11 years ago

I stated earlier the precedence for ePub was set wrong but it turned out that was not the case.

Bug is confirmed to be fixed, closing this issue.