richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

for text matches, return more positive PRONOM IDs where these are available #114

Open richardlehane opened 6 years ago

richardlehane commented 6 years ago

Where there are more precise format IDs for particular text encodings e.g. x-fmt/282 for ANSI, return these instead of generic x-fmt/111 plain text match.

(suggested by Greg Lepore)

richardlehane commented 6 years ago

I'm looking at this now and it will be a little but untidy.

My text detection routine (which is based on the file tool's algo) returns these text types:

ASCII                    // ASCII text
UTF7                     // UTF-7 Unicode
UTF8BOM             // UTF-8 Unicode (with BOM)
UTF8                     // UTF-8 Unicode
UTF16LE               // Little-endian UTF-16 Unicode
UTF16BE               // Big-endian UTF-16 Unicode
LATIN1                 // ISO-8859
EXTENDED           // Non-ISO extended-ASCII
EBCDIC                 // EBCDIC
EBCDICINT           // International EBCDIC

These don't map cleanly to PRONOM IDs.

PRONOM has these x-fmt IDs for various text types:

"x-fmt/14 (Macintosh Text File)"
"x-fmt/16 (Unicode Text File)"
"x-fmt/21 (7-bit ANSI Text)"
"x-fmt/22 (7-bit ASCII Text)"
"x-fmt/282 (8-bit ANSI Text)"
"x-fmt/283 (8-bit ASCII Text)"

These are all outline records and their meaning is a bit ambiguous. E.g. the ASCII that my text routine returns is ASCII proper (i.e. in the 0-127 byte range), so we could link it to x-fmt/22. But what is 8-bit ASCII? Does it map to Extended (i.e. extended Mac and IBM PC ASCII)? What is ANSI? Wikipedia suggests ANSI has no well defined meaning. Does it map to Latin1?

Complicating things further, the PRONOM database has unique IDs for UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE. But the IDs aren't traditional fmt or x-fmt IDs, they are "chr" IDs: chr/1, chr/2 etc. Suggest I could ignore these and just map to x-fmt/16 for the Unicode family?

So it won't be a clean mapping, but suggest could do this:

ASCII => x-fmt/22         
UTF7 => x-fmt/16
UTF8BOM => x-fmt/16
UTF8 => x-fmt/16
UTF16LE => x-fmt/16
UTF16BE => x-fmt/16
LATIN1 => x-fmt/282 // 8-bit ANSI
EXTENDED => x-fmt/283 // 8-bit ASCII
EBCDIC => x-fmt/111
EBCDICINT => x-fmt/111

Thoughts anyone?

ross-spencer commented 6 years ago

x-fmt/111 is a plain-text file and believe this is still a catch-all in Rosetta. I would suggest that this is ASCII - but understand there are specific labels for that in PRONOM. That’s really just a first observation. I think to make x-fmt/111 EBCIDIC would be like overiding a de-facto standard with another.

Does SF have any other signatures that exist and map to PRONOM where DROID doesn’t also have an ability to identify the same information?

marhop commented 6 years ago

Worth mentioning that there's fmt/159 "EBCDIC-US". However with all the different EBCDIC codepages one PUID alone won't be enough.

~I wonder if PRONOM has (or should have) another concept for encodings, like x-fmt/111 "plain text" + enc/123 "ASCII" or something similar ...~ Ignore that, I should have read your first post more thoroughly regarding the chr prefix.

richardlehane commented 6 years ago

thanks both for comments.

So, with EBDIC, could do:

ASCII => x-fmt/22         
UTF7 => x-fmt/16
UTF8BOM => x-fmt/16
UTF8 => x-fmt/16
UTF16LE => x-fmt/16
UTF16BE => x-fmt/16
LATIN1 => x-fmt/282 // 8-bit ANSI
EXTENDED => x-fmt/283 // 8-bit ASCII
EBCDIC => fmt/159
EBCDICINT => x-fmt/111

But, yes, agree with Ross that a bit weird that now EBCDICINT the only encoding that would return x-fmt/111 plain text, which is the PUID that most users have come to expect text to default to.

I should note that sf's current behaviour is to map all these text encodings to x-fmt/111 (and I copied this approach from Archivematica's fido plug-in which runs the file tool on unknowns and marks them as x-fmt/111 if file says text... so it isn't just Rosetta that has adopted x-fmt/111 as the text fall-back ID).

Options:

1) I could just leave as is. Currently the "basis" field has the encoding information when there is a text match. I could make that basis field a bit richer by including these specific PRONOM IDs. Then the info is there if a power user wants to parse it out of that field

2) I could leave as is but make configurable through roy.

Such an option could be a simple boolean flag if we're agreed on the list above:

roy build -textenc

Or a roy option could allow users to give their own text-encoding/PUID map like so:

roy build -textenc=ASCII,x-fmt/22,UTF7,x-fmt/16,UTF8BOM,x-fmt/16,UTF8,x-fmt/16,UTF16LE,x-fmt/16,UTF16BE,x-fmt/16,LATIN1,x-fmt/282,EXTENDED,x-fmt/283,EBCDIC,fmt/159,EBCDICINT,x-fmt/111
marhop commented 6 years ago

Wouldn't it be wise to sort out the format vs. encoding thing in PRONOM first? I wonder what was intended with the chr/ entries. Maybe @Dclipsham can help?