Open amayita opened 2 months ago
Hey @amayita Thanks for this detailed report. Apologies if my response is long, there is a bit to unpack!
Siegfried does do text identification (ASCII and other text encodings) but it gives priority to PRONOM signatures (including external signatures like file extensions). Generally speaking, this means that you will see the results of the text identification...
However, if the file does have a file extension that's in the PRONOM database, siegfried gives priority to that signal, even if it means returning UNKNOWN
(because PRONOM's internal signatures have not matched or because there are multiple PRONOM possibilities based on file extension alone). The rationale for this is that it is better for users to get an UNKNOWN
result with a detailed warning message, rather than get a positive result which might be inaccurate.
Consider how many formats are text at their fundamental layer but in a structured format that may have a genuine PRONOM signature. E.g. XML formats with PRONOM signatures matching against namespaces or tags. If you had one of those XML formats that failed matching (because it was corrupt or differed somehow from the PRONOM signature), would you want siegfried to give an UNKNOWN
result with information about the potential match or a positive result that the file is plain text (which may mean that the end user never checks to see if there are any issues with the file)?
Siegfried's defaults are generally conservative (it prefers to give UNKNOWN
than give possibilities that might be incorrect) but you can tweak this. Building a signature file with the command roy build -multi comprehensive
changes siegfried's behaviour and will give you any strong results, including x-fmt/111 for a text file with a ".doc" extension. But this may have consequences you don't want, e.g. multiple identifications for PDF files (e.g. a PDF-A file may be PDF 1.7 and PDF-A). More information about customising your signatures with roy
is here.
You ask if there is any possibility of an upstream fix here. This would be to change the PRONOM database. I think it was fairly common for users to give ".doc" extensions to their plain text files and it might be reasonable to add the "doc" extension to PRONOM's text ID, x-fmt/111. You could ask the PRONOM team at TNA to do this but this change would affect downstream DROID users so might not be accepted.
If the TNA doesn't want this, you could make a custom signature of your own to do it:
E.g. if you make a file like this, and put it in a "custom" folder within your siegfried home folder, then build with roy build -extend x-fmt111.xml
, you will get a text match for your file. As an Archivematica user, you are already using a custom signature file based on some of Artefactual's custom signatures. Rather than maintain your own separate custom signature file, you could also ask the Archivematica team if they would like this change for all their users.
One improvement might be to add the results of siegfried's text matching to the warning message for an unknown file. E.g. your result might still be unknown with that long list of possible formats based on extensions, but I might be able to add information about the text encoding to that warning message.
Another possible improvement could be a new build flag or build mode for roy
to make UNKNOWNs default to the result of the text match. This would probably just be a build option however, rather than a new default setting. But this might be a default setting preferred by the Archivematica team as I think this is how they used to do identification anyway (i.e. do a first pass ID with siegfried or fido, then check any unknown results against the file tool and default to a text match if detected).
Anyway, apologies again for the length of this message, I'd be very interested in feedback here.
cheers Richard
@amayita sorry I should have checked your bio before posting rather than after!!! ... wherever I said "ask the archivematica team to do X", please just replace with, "do X" :)
Hello back, Richard!
You lost me at PRONOM, humble sysadmin here :smile: LOL
Still, thank you so much for your quick and detailed answer. I think I get the picture and am very thankful that you got into different approaches to solve this "issue".
I think you worded eloquently that is is a conversation about defaults instead of an issue itself?
Regarding the "b0rked" XML file, I'd rather have XML or TXT than unknown, but I know nothing about your realm, and whenever I think I've learned a lot I am quickly humbled ;)
In my use case, the issues with the file should have been detected way before it reaches archivematica, for digital preservation, but I also see the benefit of preserving something that might not work "perfectly". I guess this needs a case by case evaluation.
I'm also interested in seeing feedback from others (that actually know what they are talking about here, unlike me)! :smile:
Thanks again!
Hello there!
I am using Siegried within archivematica to identify files, and came across one issue that is maybe "minor", but easy to fix?
When a plain ASCII text file has a .doc extension to its file name, like
ASCII_text.doc
, Siegfried fails, as it assumes it is a Word doc and does not even attempt to identify what's actually in there.The
file
command does identify theASCII_text.doc
file as ASCII text 😄Is there any way we could improve this behavior upstream? Is this too small to waste time on this? I really think an ascii txt file should not fail to be identified, no matter the filename.
More info:
But siegfried output for the same file:
Thanks for any input on this!