openpreserve / fido

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
http://openpreservation.org/technology/products/fido/
Other
145 stars 47 forks source link

Adobe Illustrator 14 file identified as PDF 1.5, not AI #41

Closed mistydemeo closed 5 years ago

mistydemeo commented 10 years ago

This Adobe Illustrator sample is being misidentified in fido 1.3.1 using the PRONOM v70 signatures: https://github.com/artefactual/archivematica-sampledata/raw/master/SampleTransfers/Images/BBhelmet.ai

The file is an Illustrator 14 (CS4) file (fmt/563), but is being identified as PDF 1.5 (fmt/19). This isn't actually wrong per se (since AI files are a superset of PDF), but isn't fully accurate. DROID 6.1.2, using the same v70 signature files, correctly identifies the file as fmt/563.

techmaurice commented 10 years ago

This has to do with the default buffersize of FIDO which is 128 kb.

Your example file seems to have the PS subset header at an offset of ~478 kb, so FIDO never sees this header and skips to the EOF part of the signature.

If you increase it to say 512 kb, FIDO will correctly recognise the file.

Example: fido.py -bufsize 512000

You also might want to increase the default buffersize by changing the default settings in the code.

adamfarquhar commented 10 years ago

Interesting. This would be the first example that I’ve seen of a file that needs more than the default 128kb to identify. I wonder if there is a better signature for AI 14? I’ve never looked at the format, but it would be surprising if one actually needed to look at 500kb before knowing a file really is an AI 14 one.

Cheers,

Adam.

From: Maurice de Rooij [mailto:notifications@github.com] Sent: 03 October 2013 10:58 To: openplanets/fido Subject: Re: [fido] Adobe Illustrator 14 file identified as PDF 1.5, not AI (#41)

This has to do with the default buffersize of FIDO which is 128 kb.

Your example file seems to have the PS subset header at an offset of ~478 kb, so FIDO never sees this header and skips to the EOF part of the signature.

If you increase it to say 512 kb, FIDO will correctly recognise the file.

Example: fido.py -bufsize 512000

You also might want to increase the default buffersize by changing the default settings in the code.

— Reply to this email directly or view it on GitHub https://github.com/openplanets/fido/issues/41#issuecomment-25610037 .

Adam Farquhar Head of Digital Scholarship Collections Division T:+44 (0)20 7412 7832

Adam.Farquhar@bl.uk The British Library London

NW1 2DB

http://www.bl.uk/ The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

techmaurice commented 10 years ago

Indeed interesting.

Unfortunately Adobe has not published specifications for this format (or maybe I just did not find them...)

After further examination it looks like the section between the PDF header and the AI subset header exists out of

Based on this we might assume the binary distance between the PDF header and the AI subset header is very variable, and depends heavily on the existence and number/size of earlier mentioned items.

techmaurice commented 10 years ago

Reopened for discussion

adamfarquhar commented 10 years ago
``` application/vnd.adobe.illustrator Looking For Adventure Yogesh Sharma 2012-02-06T17:31:28+05:30 2012-02-06T17:31:28+05:30 2012-01-12T16:09:39+05:30 Adobe Illustrator CS6 (Macintosh) ``` Adam Farquhar Head of Digital Scholarship Collections Division T:+44 (0)20 7412 7832 Adam.Farquhar@bl.uk The British Library London NW1 2DB http://www.bl.uk/ The British Library’s latest Annual Report and Accounts http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge http://www.bl.uk/emaildisclaimer.html
techmaurice commented 10 years ago

Heh, adventure indeed :+1:

techmaurice commented 10 years ago

Updated the section about read buffers in the FIDO Usage Guide.

anjackson commented 10 years ago

Would picking the format out of the XMP payload be more reliable than looking for the "%AI5_FileFormat" comment?

techmaurice commented 10 years ago

Possibly Andy. Playing around with this format currently in CS6, and looking at the XMP payload seems more reliable.

If the XMP payload is proven to be more reliable the advanced signature should be submitted to PRONOM. Of course it will be added to the extension file for the time being...

carlwilson commented 5 years ago

Closed due lack of recent activity.