richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
214 stars 30 forks source link

Preparing changes to add format types (classification) from DROID sig file #226

Open ross-spencer opened 1 year ago

ross-spencer commented 1 year ago

A handful of changes left over from https://github.com/richardlehane/siegfried/pull/209 that weren't merged plus the completion of tests to back up some of those assertions.

Code changes assume the use of an element for file format types in a DROID signature file, but can be easily changed to an attribute. We'll rebase/edit the commit if that happens. Tests should still work. The third commit should be dropped before merging.

Obviously following review there may be more changes to wrap into this.

assumed XML format:

        <FileFormat ID="655" MIMEType="video/x-msvideo"
            Name="Audio/Video Interleaved Format" PUID="fmt/5">
            <InternalSignatureID>51</InternalSignatureID>
            <Extension>avi</Extension>
            <!-- THIS FILE WILL EVENTUALLY BE REPLACED -->
            <FormatTypes>Audio, Video</FormatTypes>
        </FileFormat>

Connected to https://github.com/richardlehane/siegfried/discussions/207

richardlehane commented 1 year ago

thanks @ross-spencer, nice to be prepped for a change to the droid file. Did you get a sense from the TNA crew that this might be on the horizon?

ross-spencer commented 1 year ago

Did you get a sense from the TNA crew that this might be on the horizon?

No problem. And yes, I've been putting the case forward at the PRONOM bi-weeklys and Francesca has been following up their end. Last update is they'll likely reach out to you soon (a month? a few months maybe?) to ensure there are no other incompatibilities in the code here for any proposed changes to the signature file. There should be some other interesting things coming from that change too - but I won't ruin the surprise! (NB. shouldn't result in any further code changes).

ross-spencer commented 8 months ago

Update from the PRONOM team is that this is likely to appear: 27th/28th/29th November- will be the same content as v.115 but with the addition of the formattypes tag -- so I will look at correcting these issues this weekend or next weekend. I have been experimenting with staticcheck in other projects but it seems reasonable to park that conversation for another time.

ross-spencer commented 8 months ago

hi @richardlehane I think the necessary changes are in here now (including reverting the docs), https://github.com/richardlehane/siegfried/pull/226/commits/a8c5b493820fb1847845743fa0fb53aa688c1be0

Changes are based on the latest sample DROID file: https://github.com/digital-preservation/PRONOM_Research/blob/0d0869ba854baf44e25e9f8543f4f0c4dc98273c/Test%20Releases/DROID_SignatureFile_Classification_V2.xml -- and no further changes have been discussed in the PRONOM bi-weekly.

It looks like format type/classification will appear as an attribute, so:

<FileFormat FormatType="Image (Raster), Dataset" ID="1807" Name="Nearly Raw Raster Data"
    PUID="fmt/1002" Version="1">
    <InternalSignatureID>1357</InternalSignatureID>
    <Extension>nrrd</Extension>
</FileFormat>

or:

<FileFormat FormatType="Audio, Video" ID="655" MIMEType="video/x-msvideo"
    Name="Audio/Video Interleaved Format" PUID="fmt/5">
    <InternalSignatureID>51</InternalSignatureID>
    <Extension>avi</Extension>
    <HasPriorityOverFileFormatID>2741</HasPriorityOverFileFormatID>
</FileFormat>

Building with noreports:

---
siegfried   : 1.11.0
scandate    : 2023-11-05T19:41:35+01:00
signature   : default.sig
created     : 2023-11-05T19:41:19+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml; built without reports'
---
filename : 'testdata/skeleton-suite/fmt/fmt-1002-signature-id-1357.nrrd'
filesize : 9
modified : 2023-09-16T22:53:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1002'
    format  : 'Nearly Raw Raster Data'
    version : '1'
    mime    : 
    class   : 'Image (Raster), Dataset'
    basis   : 'extension match nrrd; byte match at 0, 9'
    warning : 

and reports:

---
siegfried   : 1.11.0
scandate    : 2023-11-05T19:42:18+01:00
signature   : default.sig
created     : 2023-11-05T19:42:07+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V114.xml; container-signature-20230822.xml'
---
filename : 'testdata/skeleton-suite/fmt/fmt-1002-signature-id-1357.nrrd'
filesize : 9
modified : 2023-09-16T22:53:24+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/1002'
    format  : 'Nearly Raw Raster Data'
    version : '1'
    mime    : 
    class   : 'Image (Raster), Dataset'
    basis   : 'extension match nrrd; byte match at 0, 9'
    warning :

Looks equivalent. New tests pass as anticipated. Let me know if it needs further revision. I can rebase into a single commit once it looks okay.

Not sure how to handle replacing the test signature file when time comes to merge? I think it will eventually just get replaced with the new one?

richardlehane commented 8 months ago

thanks @ross-spencer this all looks great to me!

ross-spencer commented 1 month ago

Note from TNA but they haven't managed to export the changes on their side just yet, from how it sounds it seems like there's an additional process happening during signature export/publication that's removing the expected data. It's something they're looking at.