richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
214 stars 30 forks source link

Add PRONOM types to PRONOM identifier #209

Closed ross-spencer closed 1 year ago

ross-spencer commented 1 year ago

Exploration adding PRONOM types classification to the Siegfried PRONOM identifier.

Connected to: https://github.com/richardlehane/siegfried/discussions/207

This results in a new output from Siegfried which looks something as follows:

filename : 'testdata/skeleton-suite/x-fmt/x-fmt-95-signature-id-858.pwi'
filesize : 5
modified : 2020-07-05T19:53:49+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/95'
    format  : 'Inkwriter/Notetaker Document'
    version : 
    mime    : 
    type    : 'Word Processor'
    basis   : 'extension match pwi; byte match at 0, 5'
    warning : 

Note the addition of type : 'Word Processor'

NB. This will only show a value if the PRONOM identifier is configured with PRONOM reports, i.e. the PRONOM XML export from PRONOM itself. The DROID signature file still needs this information to be added, we believe this is on the way. I can attend the next PRONOM meeting at the beginning of the year to ask more.

Tests have been included as part of this feature. Additionally, source files have had linting changes made to them to pass linting. These are in the third commit associated with the PR and may warrant special attention for accuracy, especially around the correctness of the documentation.

ross-spencer commented 1 year ago

NB. I was accidentally testing in "production" against this branch today and it looks like type may have crept into the DROID report and so I need to go back to the DROID CSV creation and make sure that doesn't happen, and then likely include a DROID specific test to make sure headers are output correctly.

NB. Also, chatted to the PRONOM crew today in the PRONOM weekly. David C checked DROID signature files for compatibility, and they look good. One question is whether the SOAP service that delivers the XML to DROID has a different take on this, but overall it sounds positive this can potentially be added. It is a good time to ask with other PRONOM changes in the works over the course of the year.

ross-spencer commented 1 year ago

A small DROID header test has been added here: https://github.com/richardlehane/siegfried/pull/209/commits/957c2e7aec97f30a3cb9746aaf1d563161158039

To clarify, the "TYPE" field is specific to the DROID CSV, and so, type in the standard YAML/JSON/CSV outputs of SF may not be a good idea, be somewhat confusing.

@robin-francois @richardlehane is Classification the preferred term for the format type/classification field? Does it make sense to others reading this?

LoC uses "content categories" to describe this: https://www.loc.gov/preservation/digital/formats/content/content_categories.shtml

PRONOM as we know uses "Classification".

Wikidata doesn't seem to have the equivalent predicate, it tends towards instance of, and use/used for to describe similar. There may be an equivalent predicate.

robin-francois commented 1 year ago

Thanks @ross-spencer, that's a splendid job.

Wikipedia seems to use type of format in some pages. I have no favoured term. Both category or class seem to be appropriate instead of the misleading type.

ross-spencer commented 1 year ago

Thanks @robin-francois - both good options, perhaps preceded by "format" e.g. "format class" in the one example? Thinking about it I'll add these notes to the discussion page and then post it on digipres.club/Twitter and see if folks have an opinion too. I'm not sure why the discussion works differently from the PR comments - but I think maybe it does?

NB. it's also really easy to change these things, so once there's a name we've settled on, I can make those changes and the PR should be good for review.

prwheatley commented 1 year ago

Would be great to have this - and useful to have a decent standard set of content type categories (however imperfect they always are) for use in other areas. COPTR has its own that aligns/overlaps pretty well with LOC, albeit with some different titles and a few that aren't on the LOC list. https://coptr.digipres.org/index.php/Content_Types

richardlehane commented 1 year ago

thanks for all the work on this @ross-spencer ... it's looking good!

There's a chance that some of the current integrations with sf are relying on the number/order of fields (e.g. if they are using csv output they may expect certain elements in certain columns). I think adding this new "classification" field should become the new default but I wonder whether adding a flag to roy to build a signature file without the field (e.g. "-noclass" flag) might make sense as a fallback in case any integrations get broken? This could also be how signatures built with droid xml files only get handled.

To do this you'd probably need to add a new bool field in the PRONOM identifier to indicate whether or not classifications are used & then make the output functions check that field to give the two possible forms of output. What do you think?

richardlehane commented 1 year ago

merged into develop branch with this commit: https://github.com/richardlehane/siegfried/commit/98516b176a65481bc4da9a45be61b16b7555dc69

ross-spencer commented 1 year ago

@richardlehane I think that squish probably lost the attribution to me unfortunately. I'm not 100% sure. The GitHub UI is a bit flaky on this. You probably wanted something like this Co-authored-by: https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors#required-co-author-information I noticed the fixup work. It's not that I didn't have time to do this, I just didn't have all the information about what needed changing, and was holding off until I heard back a little bit more from PRONOM, i.e. seeking closer alignment with PRONOM/DROID changes. Still. Good to have this in.

richardlehane commented 1 year ago

hey @ross-spencer, sorry I screwed up when doing that merge, should just not have squashed it. I've fixed the authorship now on the develop branch, unfortunately it looks like merging back into main will be painful. I've been merging in as much as possible as I'd like to cut the new release soon (maybe this weekend?) but if you need a bit more time for this one let me know

ross-spencer commented 1 year ago

Thanks @richardlehane. Squishing was a good instinct. With a bit of a heads up, I can help with anything like that in future. I was anticipating rebase/merge myself, but it's a bit of a (necessary) dance. Some of my commits were out of sequence (a price to pay for not having multiple PRs which I have found more often than not a greater headache than git-fu). I don't sense any particular concerns from the PRONOM team about releasing these changes into the world with Siegfried, that's more on me, so I don't think there's a need to hold back. The DROID changes are being investigated with no concrete promises, but it will be exciting to have that change to the DROID sig file in time.

richardlehane commented 1 year ago

ok so, in rectifying the authorship history, I completely messed up the develop branch... it needs to be sacrificed now to the gods of git. In my defence the stack overflow answer I was following did have 700+ points.

So... I've made a new "release" branch (https://github.com/richardlehane/siegfried/tree/release) and cherry picked the intervening commits from develop into it. This does change history a bit (dates are gone and it makes me committer for everything) but authors are fixed, history is linear, and it can be merged back into main without git screaming at me. Pls work off release for now if you've got any other things in train. I'll hose develop and after the 1.10 release will start a new develop branch.

Next time I'll take you up on that git consultancy offer, or can we just move to fossil