richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

Rules for identifying text encoding / text #67

Closed ross-spencer closed 8 years ago

ross-spencer commented 8 years ago

Hi Richard,

What are the rules for SF identifying the encoding of a plain-text file? - I presume it is based on just the extension .txt which will return x-fmt/111 which will then lead to identification of the encoding...

I have a number of .doc files (emails I believe saved with this extension from a mail client) but they're essentially plain text.

Like DROID, the multiple-id for .DOC is returned.

Is there a way to run a file without a positive ID through the encoding handler to test the likelihood of it being plain-text, or do you think this is creeping into the wrong style of identification?

N.B. I did consider this for the droid-list, but it makes little sense in DROID until they implement an encoding handler.

Cheers,

Ross

richardlehane commented 8 years ago

Hi Ross

The Identify() method encodes this rule: https://github.com/richardlehane/siegfried/blob/master/siegfried.go#L257

SF runs the following matching routines in order:

(Matcher is an interface so additional matching routines can be added over time - e.g. I'm currently working on an XML matcher).

At two points, after the container matcher and byte matcher run, SF asks each of the Identifiers whether they are satisfied. (Identifier is another interface and represents a set of identifications e.g. the PRONOM set, or the Tika set, etc. Multiple identifiers can subscribe to a single matching process though normally users just have the one "pronom" identifier ... the default.) Once all identifiers are satisfied, scanning stops.

The Satisfied() method for the PRONOM identifier is defined here: https://github.com/richardlehane/siegfried/blob/master/pkg/pronom/identifier.go#L291

Basically this says: if I have low confidence (only an extension or MIME type match) and the byte matcher is the next matcher to run, I'm not yet satisfied. If however, I have low confidence but the next matcher is the text matcher (so we've already done byte matching), then I am satisfied. This effectively gives extension matches precedence over the text matcher. The reason for this is we don't want x-fmt/111 overriding more precise identifications based on extension such as CSS or JS. The effect is we'll only do text matching if the result will otherwise be UNKNOWN (no extension or MIME match). The one exception is that the text matcher will run if we've extension matched ".txt" so that we can provide encoding information in the basis field of the match.

Am happy to look at changing this rule so that the text matcher always runs when low confidence (extension or MIME only match).

richardlehane commented 8 years ago

A possible refinement could be to halt before text matching if:

  1. we have a byte or container match (positive ID)
  2. we have an extension or MIME match for which no byte or container signature has been defined (like CSS or JS), keeping the ".txt" exception.

This would mean that we would now proceed to text matching if we have an extension match but the corresponding byte/container signatures for that fmt haven't matched.

Not 100% sure about this though because some users might still prefer UNKNOWN in this case so that they are prompted to look more closely. Also, in your example I think there are some ".doc" puids without byte or container signatures (i.e. where an extension-only match is possible). E.g. the Macintosh doc formats. So this fix wouldn't meet your use case anyway. It might be worth prompting TNA to remove these extension-only doc signatures as they probably cause more harm than good!

richardlehane commented 8 years ago

pps there is a way to trick sf into text matching everything.

Just add a second identifier that can only identify text:

roy add -limit x-fmt/111

This will give you a second result for every single file that will tell you whether or not the file is also legal text. And it should only marginally slow things down.

ross-spencer commented 8 years ago

Thanks for the information! - I'll put if forward to the team here too just to get an idea of what might be good behavior.

I want to be careful not to be greedy (with DROID/SF etc.) there's a limit to what they should be doing.

My original need may sit outside of both of these tools - put simply, the requirement is - 'identifies text/character based formats and their encoding'.

It's a fallback - if I know the content of a file is all character information and I can understand the encoding - I may have less to worry about going into 'a' system. But also, in this case, I can arbitrarily rename any text file .doc and it slips through the net with the IDs that PRONOM records - it might indeed be an approach to try and help PRONOM figure out an alternative here too.

I'll have a look at the technique with the second identifier - good idea! - it could be useful for triaging.

Ultimately, it seems there may simply be need to incorporate another tool in the workflow.

On Wed, Jan 20, 2016 at 6:32 PM, Richard Lehane notifications@github.com wrote:

pps there is a way to trick sf into text matching everything.

Just add a second identifier that can only identify text:

roy add -limit x-fmt/111

This will give you a second result for every single file that will tell you whether or not the file is also legal text. And it should only marginally slow things down.

— Reply to this email directly or view it on GitHub https://github.com/richardlehane/siegfried/issues/67#issuecomment-173093340 .

richardlehane commented 8 years ago

Cool - I'm a bit hesitant to rush into changing this part of the code as it will impact the results users get (always want to be a bit careful about that!).

If you wanted a standalone tool just to do fast text detection you could create a text-only signature file and run sf with that:

roy build -limit x-fmt/111 -name textOnly text.sig sf -sig text.sig ...

(and in the roy add example I gave above - probably best to include a name for the identifier too.. so roy add -limit x-fmt/111 -name textOnly)

richardlehane commented 8 years ago

added this to the Tips and Tricks page of the wiki: https://github.com/richardlehane/siegfried/wiki/Tips-and-Tricks#identifying-plain-text-encodings

ross-spencer commented 8 years ago

Thanks Richard. I might move this to droid-list in a few days to hold a bigger discussion.

In short - despite the clues to a file being '.doc' we know if it is 100% ASCII - it isn't going to be anything OLE2 (or older Word), and so for the team here, I want to discuss:

And then for the droid-list I suspect there is something around the arrangement/categorization of text based formats. I will have a look at the PRONOM data model too.

On Thu, Jan 21, 2016 at 2:59 PM, Richard Lehane notifications@github.com wrote:

added this to the Tips and Tricks page of the wiki: https://github.com/richardlehane/siegfried/wiki/Tips-and-Tricks#identifying-plain-text-encodings

— Reply to this email directly or view it on GitHub https://github.com/richardlehane/siegfried/issues/67#issuecomment-173426868 .

richardlehane commented 8 years ago

tidying up issues