[advice] Siegried fails to identify a txt file when the filename extension is "wrong"

amayita commented 2 months ago

Hello there!

I am using Siegried within archivematica to identify files, and came across one issue that is maybe "minor", but easy to fix?

When a plain ASCII text file has a .doc extension to its file name, like ASCII_text.doc, Siegfried fails, as it assumes it is a Word doc and does not even attempt to identify what's actually in there.

The file command does identify the ASCII_text.doc file as ASCII text 😄

Is there any way we could improve this behavior upstream? Is this too small to waste time on this? I really think an ascii txt file should not fail to be identified, no matter the filename.

More info:

$ file bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc
bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc: ASCII text, with very long lines (690), with CRLF line terminators

But siegfried output for the same file:

$ sf bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc
---
siegfried   : 1.11.0
scandate    : 2024-07-29T09:58:17Z
signature   : archivematica.sig
created     : 2023-12-17T15:55:42+01:00
identifiers :
  - name    : 'archivematica'
    details : 'wikidata-definitions-3.0.0; extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
---
filename : 'bitstream_76d19645-b414-4c03-a268-27a6fb73f157.doc'
filesize : 12736
modified : 2024-04-24T15:36:38Z
errors   :
matches  :
  - ns      : 'archivematica'
    id      : 'UNKNOWN'
    format  :
    version :
    mime    :
    class   :
    basis   :
    warning : 'no match; possibilities based on extension are x-fmt/42, x-fmt/43, x-fmt/44, x-fmt/131, x-fmt/274, x-fmt/275, x-fmt/276, x-fmt/329, fmt/39, fmt/40, fmt/37, fmt/38, x-fmt/393, x-fmt/394, fmt/473, fmt/609, fmt/754, fmt/892, fmt/1282, fmt/1283, fmt/1688'

Thanks for any input on this!

richardlehane commented 2 months ago

Hey @amayita Thanks for this detailed report. Apologies if my response is long, there is a bit to unpack!

What is siegfried doing here?

Siegfried does do text identification (ASCII and other text encodings) but it gives priority to PRONOM signatures (including external signatures like file extensions). Generally speaking, this means that you will see the results of the text identification...

if the file has a ".txt" extension (siegfried will confirm that the contents really are text)
or if there is no PRONOM external signature at all (i.e. the file doesn't have a file extension, e.g. a README file, or the file extension isn't known by PRONOM)

However, if the file does have a file extension that's in the PRONOM database, siegfried gives priority to that signal, even if it means returning UNKNOWN (because PRONOM's internal signatures have not matched or because there are multiple PRONOM possibilities based on file extension alone). The rationale for this is that it is better for users to get an UNKNOWN result with a detailed warning message, rather than get a positive result which might be inaccurate.

Consider how many formats are text at their fundamental layer but in a structured format that may have a genuine PRONOM signature. E.g. XML formats with PRONOM signatures matching against namespaces or tags. If you had one of those XML formats that failed matching (because it was corrupt or differed somehow from the PRONOM signature), would you want siegfried to give an UNKNOWN result with information about the potential match or a positive result that the file is plain text (which may mean that the end user never checks to see if there are any issues with the file)?

You can change the defaults

Siegfried's defaults are generally conservative (it prefers to give UNKNOWN than give possibilities that might be incorrect) but you can tweak this. Building a signature file with the command roy build -multi comprehensive changes siegfried's behaviour and will give you any strong results, including x-fmt/111 for a text file with a ".doc" extension. But this may have consequences you don't want, e.g. multiple identifications for PDF files (e.g. a PDF-A file may be PDF 1.7 and PDF-A). More information about customising your signatures with roy is here.

Can this behaviour be improved upstream?

You ask if there is any possibility of an upstream fix here. This would be to change the PRONOM database. I think it was fairly common for users to give ".doc" extensions to their plain text files and it might be reasonable to add the "doc" extension to PRONOM's text ID, x-fmt/111. You could ask the PRONOM team at TNA to do this but this change would affect downstream DROID users so might not be accepted.

If the TNA doesn't want this, you could make a custom signature of your own to do it:

E.g. if you make a file like this, and put it in a "custom" folder within your siegfried home folder, then build with roy build -extend x-fmt111.xml, you will get a text match for your file. As an Archivematica user, you are already using a custom signature file based on some of Artefactual's custom signatures. Rather than maintain your own separate custom signature file, you could also ask the Archivematica team if they would like this change for all their users.

What improvements can be made to siegfried here?

One improvement might be to add the results of siegfried's text matching to the warning message for an unknown file. E.g. your result might still be unknown with that long list of possible formats based on extensions, but I might be able to add information about the text encoding to that warning message.

Another possible improvement could be a new build flag or build mode for roy to make UNKNOWNs default to the result of the text match. This would probably just be a build option however, rather than a new default setting. But this might be a default setting preferred by the Archivematica team as I think this is how they used to do identification anyway (i.e. do a first pass ID with siegfried or fido, then check any unknown results against the file tool and default to a text match if detected).

Anyway, apologies again for the length of this message, I'd be very interested in feedback here.

cheers Richard

richardlehane commented 2 months ago

@amayita sorry I should have checked your bio before posting rather than after!!! ... wherever I said "ask the archivematica team to do X", please just replace with, "do X" :)

amayita commented 2 months ago

Hello back, Richard!

You lost me at PRONOM, humble sysadmin here :smile: LOL

Still, thank you so much for your quick and detailed answer. I think I get the picture and am very thankful that you got into different approaches to solve this "issue".

I think you worded eloquently that is is a conversation about defaults instead of an issue itself?

Regarding the "b0rked" XML file, I'd rather have XML or TXT than unknown, but I know nothing about your realm, and whenever I think I've learned a lot I am quickly humbled ;)

In my use case, the issues with the file should have been detected way before it reaches archivematica, for digital preservation, but I also see the benefit of preserving something that might not work "perfectly". I guess this needs a case by case evaluation.

I'm also interested in seeing feedback from others (that actually know what they are talking about here, unlike me)! :smile:

Thanks again!

richardlehane / siegfried