richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

text file with .dat extension will returned as unkown #158

Closed ledmirage closed 2 years ago

ledmirage commented 3 years ago

recently i noticed that if a text file (with extension .txt), sf can detect as text. if i change to other extension, like .aaa, it seems fine too (also detected as text)

however, if the text file is having extension .dat, it will be detected as unknown

what is the logic behind this? is there a way to make sure text file with .dat extension still detected as .txt ?

sample output for .dat:

sf -home "c:\bin\sf" sample.dat

siegfried : 1.8.0 scandate : 2021-02-18T15:46:47+08:00 signature : default.sig created : 2020-08-20T16:32:29+08:00 identifiers :

  • name : 'pronom_custom' details : 'DROID_SignatureFile_V96.xml; container-signature-20200121.xml; extensions: dmp-v1.0-signature-file.xml, wsq-v1.0-signature-file.xml'

    filename : 'sample.dat' filesize : 11 modified : 2021-02-18T15:46:24+08:00 errors : matches :

  • ns : 'pronom_custom' id : 'UNKNOWN' format : version : mime : basis : warning : 'no match; possibilities based on extension are fmt/612, fmt/819, fmt/1228'

sample file but different extension (.aaa) is okay:

sf -home "c:\bin\sf" sample.aaa

siegfried : 1.8.0 scandate : 2021-02-18T15:46:49+08:00 signature : default.sig created : 2020-08-20T16:32:29+08:00 identifiers :

  • name : 'pronom_custom' details : 'DROID_SignatureFile_V96.xml; container-signature-20200121.xml; extensions: dmp-v1.0-signature-file.xml, wsq-v1.0-signature-file.xml'

    filename : 'sample.aaa' filesize : 11 modified : 2021-02-18T15:46:39+08:00 errors : matches :

  • ns : 'pronom_custom' id : 'x-fmt/111' format : 'Plain Text File' version : mime : 'text/plain' basis : 'text match ASCII' warning : 'match on text only; extension mismatch'

sample attached, they are just simple text file with string "hello world"

sample.zip

richardlehane commented 3 years ago

Hi @ledmirage: you're getting unknown here because there's an extension match from the PRONOM registry. In this case sf is assuming it could be a malformed file of one those types, and exits with UNKNOWN and a warning. As you've found, you'll only get a x-fmt/111 (or text) match if the file has a ".txt" extension or if no extension matches in the PRONOM database. This covers many other types of text files you see in the wild such as "README" files.

You can use the roy tool to modify your signature file to give more control over how results are reported. Instructions for this are here: https://github.com/richardlehane/siegfried/wiki/Building-a-signature-file-with-ROY

There's a few approaches you could take, depending on the result you want...

1) Do you want "x-fmt/111" reported for the .dat files you are matching? Or are these some more specific file type that isn't in PRONOM at all? If that's the case, you could try to get the file type registered with PRONOM and, in the meantime, add a custom format to your signature file. Command for this is: roy build -extend custom-fmt1.xml (add custom signatures in DROID format e.g. using this utility. Custom signature should be placed in a "custom" directory within your siegfried home directory)

2) If you never expect any of the three .dat signatures in PRONOM to match any of the files in your repository, you could just exclude those signatures to get the result you want. Command for this is: roy build -exclude @.dat

3) Or you could use the "-multi" flag to try a more exhaustive mode of matching. This flag alters the rules that sf applies when trying to determine which formats. I'm not sure if this will help, but you could try: roy build -multi exhaustive