kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
63 stars 63 forks source link

Generalisation of the Function "evaluate docket" #5573

Open PeterJunger opened 1 year ago

PeterJunger commented 1 year ago

Description

The Swiss federal archive proposes to take over the function evaluate docket in order to enable the pre-distortion of the archival records.

Related Issues

Enable the pre-distortion using bar code pages (order, envelope, document, documents, dossier, sub-dossier)

Expected Benefits of this Development

Archives need pre-distortion in order to map the structure of the archival records, as the structure has to be broken up for the digitisation process and is therefore no longer visible afterwards.

Estimated Costs and Complexity

The complexity of the development is medium.

e.g.

oliver-stoehr commented 1 year ago

Structuring is currently done in the metadata editor. This feature request proposes an additional way of structuring, which is done with physical divider sheets by the user and automatically translated into structure elements by Kitodo.

How does the automatic structuring work? The user places physical divider sheets between the sheets of the workpiece to be scanned. These divider sheets represent the individual logical structures of the workpiece. Each divider sheet contains human- and machine-readable info to identify the type of logical structure that is represented by this sheet.

The workpiece is scanned together with the divider sheets. An automatic task in the workflow evaluates the scans. The machine-readable part on the divider sheets helps Kitodo to identify them and decide how to handle them. Kitodo can create a new logical structure from the information found on the divider sheet and assign all following pages to this new logical structure. The scans of the divider sheets are automatically removed after the structuring.

Divider sheets The divider sheets contain generic information about the logical structure they represent. (This information is displayed in human- and machine-readable form, e.g. a normal text and QR-code). For example, there might be one divider sheet for chapters, another for table of contents and a third one for the book cover. These divider sheets are not process- or workpiece-specific. They can be used in different processes and multiple divider sheets of the same type can be used in the same process (representing chapters for example). However the divider sheets are ruleset-specific, because they represent logical structures of a specific ruleset.

A new tab should be added to the "edit ruleset" page where the divider sheets for the ruleset can be configured. The configured divider sheets could be printed from a new button located in the templates list.

Example A physical workpiece might have the following structure after placing the divider sheets, for example:

The automatic structuring would create the following structure:

(Automatic pagination is not part of this feature. The page numbers are shown in this example for better readability.)

I estimate the costs for this development as high.

aetherfaerber commented 1 year ago

Structuring is currently done in the metadata editor. This feature request proposes an additional way of structuring, which is done with physical divider sheets by the user and automatically translated into structure elements by Kitodo.

How does the automatic structuring work? The user places physical divider sheets between the sheets of the workpiece to be scanned. These divider sheets represent the individual logical structures of the workpiece. Each divider sheet contains human- and machine-readable info to identify the type of logical structure that is represented by this sheet.

The workpiece is scanned together with the divider sheets. An automatic task in the workflow evaluates the scans. The machine-readable part on the divider sheets helps Kitodo to identify them and decide how to handle them. Kitodo can create a new logical structure from the information found on the divider sheet and assign all following pages to this new logical structure. The scans of the divider sheets are automatically removed after the structuring.

I welcome this proposal to add an (in part) already developed solution to the core features. Especially, I second the need for an “additional way of structuring” or more bluntly for a way of not having to use the graphical metadata editor. (Not that there is something bad about it in any way, but opening and using it is simply too cumbersome and time-consuming if it needs to be done for each item.)

As suggested, a generalization from the already existing implementation is needed and I think this should not only apply to the codebase but to the conceptualization in general.

If I understand correctly, the proposition as laid out above consists of mainly two new components:

  1. a new tab in the “edit ruleset“ page where divider sheets can be configured and printed.
  2. a new process that can be added to workflows which 2a. detects the divider sheets and reads the barcodes etc on them, deletes the divider sheets 2b. adds the structural information for the images between the divider sheets into the kitodo metadata

From my point of view step 2b (automatic assignment of structural information) seems very important for a wide range of use cases. Its inclusion alone may justify support for this entire proposal on the one hand, on the other it should not be tied to one single use case like divider sheets.

What other use cases?

  1. Already existing collections

This is the thing which preoccupies me the most. Of course it is possible to insert image files of divider sheets in already existing collections. But if you have several million pre-existing images you'll have to automate that and if you want to get anything useful out of it, different workflows will need to be applied depending e.g. on the document type. In that case, you could make good use us a workflow engine such as Kitodo. But a workflow consisting of the steps 2a and 2b described above and an additional step 0 which is „add, depending on information xy, the divider sheets that will be removed again two steps later“ is, I hope we can all agree on that, not desireble. It would be far better if Kitodo could do the structuring depending on the given data itself. Thus:

  1. Automatic structuring by metadata entries/filenames/other attributes

There are certain document types that just will be structured exactly the same every time and should always be scanned/saved in order of this structure. I can also trust the scan personell to do this without having them put in divider sheets (I could also say that I can trust them equally or more to do so as I can trust them with inserting the sheets correctly) and adding them just costs time and is another repetitive and exhausting task. So let's just skip it and do the structuring dependent on the doctype given in the imported metadata. In existing collections (see above) structural information is also often represented in filenames and these could therefore could easily converted into actual metadata containing structural information. Additionally there are other file attributes (image size for example) that could easily be used to detect cover pages, empty pages, inserted tabular sheets or maps, if there is a need for that.

3. Automatic structuring by OCR/HTR.

This is where things are getting really interesting. Of course it is understandable that this won't be part of a first implementation but the potential is huge. It should at least be considered as something to keep in mind for future days if there no clear decision to keep this out of scope for good.

What should be changed about the proposal then?

  1. As a bare minimum, make sure that the nuclear process of automatic assignment of structural information described in 2b is well documented so anyone can happily write custom scripts that insert or modify the structural data as part of a workflow without the need to have someone open the metadata editor. Please excuse all the fuss if that is already done somewhere, I haven't been able to find that information.
  2. If possible, put a dropdown on the new tab in the “edit ruleset” page where one can choose between “automatic structuring by divider sheets”, „automatic structuring by custom query“. The custom query page asks for the query (e.g. a shell command) and a list of possible result ranges and corresponding structural information.
  3. If possible, add the options „automatic structuring by metadata“ and „automatic structuring by filenames” to the drop-down that come with nicely prepared configuration pages.
aetherfaerber commented 1 year ago

The archives (Landesarchiv Hessen, Kreisarchiv Esslingen and Landesarchiv Schleswig-Holstein) wish to take over the function evaluate docket in order to enable te pre-distortion of the archival records.

As my comment may already suggest this has not been decided on and may at least in our case not be the exact feature we need.

aetherfaerber commented 1 year ago
3. Automatic structuring by OCR/HTR.

This is where things are getting really interesting. Of course it is understandable that this won't be part of a first implementation but the potential is huge. It should at least be considered as something to keep in mind for future days if there no clear decision to keep this out of scope for good.

A proposal towards this direction is actually being prepared.