kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
63 stars 63 forks source link

metadata "processTitle" creates an an additional <mets:dmdSec> for some document types #6024

Closed andre-hohmann closed 4 months ago

andre-hohmann commented 5 months ago

Describe the bug If a new process is created, the metadata element processTitle is written in an own <mets:dmdSec> in the internal METS file. This element is connected with the first <mets:div>-element in <mets:structMap TYPE="PHYSICAL"> - instead of <mets:structMap TYPE="LOGICAL">. This leads to some empty metadata elements in the exported METS-file. Probably, this depends on the XSLT configuration.

This behavior can be observed for processes of the type manuscript, monographs, volumes, ... For processes of the type newspaper year, newspaper issue, ephemera year and ephemera issue, the metadata processTitle is written in the existing <mets:dmdSec>, which is connected with <mets:structMap TYPE="LOGICAL">.

Probably, this is caused by:

See also:

To Reproduce Steps to reproduce the behavior:

  1. Create a process (monograph, volume) in Kitodo.Production 3.6.0 or a newer version.
  2. Open the internal METS file
  3. See the two <mets:dmdSec>-elements

Expected behavior The metadata processTitle should always be written in the existing <mets:dmdSec>, which is connected with <mets:structMap TYPE="LOGICAL">. According to the behavior of the processes of the type newspaper, ephemera (tile, year, issue).

Probably, this must be analyzed more deeply. It seems there are more consequences:

Release 3.6.2 - probably since 3.6.0

Desktop (please complete the following information):

Additional context

Internal METS file: Manuscript ``` xml 761642 HistdeIeu_1877958727 Historia de Iephta unigenitam ex voto filiam holocaustum Deo offerente carmine hero-ico conscripta - Mscr.Dresd.A.268 Handschriften Lateinische neuzeitliche Handschriften PDM1.0 Manuscript 34 Blätter Mscr.Dresd.A.268 fehlender_Schrifttyp lat 1877958727 1572 HistdeIeu Saxonica Heidelberg Historia de Iephta unigenitam ex voto filiam holocaustum Deo offerente carmine hero-ico conscripta - Mscr.Dresd.A.268 1572 Briesmann, Pascasius Briesmann http://d-nb.info/gnd/119630052 Pascasius aut 53910020X gnd VerfasserIn 119630052 Handschriftenportal https://resolver.staatsbibliothek-berlin.de/HSP0003C72600000000 ```
Internal METS file: Volume ``` xml 761641 GebuStPi1_1882036999_0006 0006 060 PDM1.0 1881717593 GebuStPi1 360 ungezählte Seiten Geburtsregister StA Pirna-Stadt 1882 - SAP P-I-XX-6 SAP P-I-XX-6 1882 fehlender_Schrifttyp Volume 1882036999 ger Saxonica LDP: Bestände des Archivverbundes Pirna Pirna 1882 [6] Geburtsregister StA Pirna-Stadt 1882 - SAP P-I-XX-6 ```
Internal METS file: Issue (Ephemera) ``` xml 761632 PDM1.0 Fraktur Performance Ephemera 01-Musiktheater PeriodicalIssue KniHo_796585679-1860010901_01-m Saxonica ger Programmhefte der Sächsischen Staatstheater Musik ```
andre-hohmann commented 5 months ago

In my opinion, this is a severe problem, because inconsistent internal metadata is created, which

  1. probably must be corrected to achieve consistent internal metadata.
  2. could lead to unnecessary configuration efforts of the export XSLT to solve at least the most obvious problems.

@solth : According to @henning-gerhardt, i should inform you explicitly, because for me this is a blocker for the productive use of the affected versions in SLUB Dresden. Since I was informed about this behavior by Leipzig University Library, it seems to be relevant for other institutions as well.

andre-hohmann commented 5 months ago

The additional element processTitle for all document types could possibly lead to considerable adjustments to the export XSLT in SLUB Dresden - regardless of the dmdSec in which the element is inserted.

The reason for is the derivation of the values from the elements processTitle and CatalogIDDigital for the creation of the content of several MODS fields.

If both elements are present in all METS files, the existing rules must be adapted, as a new combination exists now: processTitle AND CatalogIDDigital. Until now, only the combination processTitle OR CatalogIDDigital exists.

I wonder if it is possible to make the creation of processTitle for monographs, ... configurable to avoid adjustments in the XSLT.

For SLUB Dresden, the metadata element processTitle` is not necessary for most document types. As all of our current processes for monographs, manuscripts, ... do not contain the element, i cannot imagine a scenario to apply it.

BartChris commented 5 months ago

In my opinion, the problem lies in explicitly calling addAllowedMetadataRecursive for the physical structure here.

https://github.com/kitodo/kitodo-production/blob/5be1123d509c4ca81515f5b7f1cb84e00e36ca4b/Kitodo/src/main/java/org/kitodo/production/forms/createprocess/CreateProcessForm.java#L650-L655

When i comment out this line

addAllowedMetadataRecursive(workpiece.getPhysicalStructure(), keysForProcessTitle, processTitle);

the described generation of an additional DMDSec does not occur. If we wish to maintain this functionality, we should definitely avoid attaching metadata to the wrapper element for the actual pages, which is what happens here. Additionally, I am not even sure why we need to call that method for the physical structure here at all. Is there really a need to add the process title as metadata to a page, for example? I have already noted the problematic effects of accidentally adding metadata to pages here: https://github.com/kitodo/kitodo-production/issues/4921. The method in question is also called only during the creation of the process when (usually) no pages exist.

So i would only call addAllowedMetadataRecursive(workpiece.getLogicalStructure(), keysForProcessTitle, processTitle); and not addAllowedMetadataRecursive(workpiece.getPhysicalStructure(), keysForProcessTitle, processTitle);

solth commented 5 months ago

@andre-hohmann I added the blocking label to signal its severity. @BartChris I think there are definitely cases where adding metadata for individual pages is required, for example information about people depicted only in some images in a process. @matthias-ronge can you see whether the proposed fix would pose any unwanted side effects? Is there a reason for calling the code line in question for the physical structure?

andre-hohmann commented 5 months ago

@solth : thanks a lot! Maybe, for other institutions this issue might not be blocking, so both the issue and the label can be discussed. I would be very interested in the opinion of other users of Kitodo.Production.

I wanted to point out that the current implementation can lead to increased efforts for the export XSLT. I have not examined this in detail, but the main problem is that processTitle is always added. The deletion of the dmdSec via XSLT is difficult, because the DMDID in <mets:structMap TYPE="PHYSICAL"> needs to be deleted, too. Even if you apply other scripts, these efforts are in my opinion not really necessary. The export itself is complicated enough. If the processTitle is exported to avoid the deletion of dmdSec, you probably will receive warnings from the DDB analysis report, because the dmdSec does not contain any title information.

So, this might be complicated and for SLUB Dresden, i would first analyse the possible consequences and i want to avoid as many "repairs" as possible. For the other issue, it is clear that a wrong process title is assigned to superordinate processes:

I agree that it can/should be possible to apply metadata for a specific page, but these metadata is probably assigned on purpose and does not lead to "unwanted" dmdSec. Thus, this does not cause any problems or at least, i cannot imagine any.

andre-hohmann commented 5 months ago

In addition: The concept seems inconsistent to me because processTitle for newspaper issues, newspaper year, ... is not allocated to <mets:structMap TYPE="PHYSICAL">, but to <mets:structMap TYPE="LOGICAL">.

andre-hohmann commented 5 months ago

Here is short update with some new insights:

1 After consultation with @subhhwendt , i have learned, that this behavior only occurs, if processTitle is configured in the ruleset. As far as i know, this is only necessary, if processes for newspapers are created. Probably, only institutions are affected, which create processes for newspaper and non-newspaper-materials.

2 Additionally irritating is, that the processTitle in the new dmdSec is not shown in the metadata editor and cannot be deleted or corrected there. If it could be deleted, it would be an option to solve the problem.

andre-hohmann commented 5 months ago

Here is another suggestion to reduce the effect of the functional extension of key="processTitle" - in addition, that no additional dmdSec is created.

key="processTitle" should only be created, if it is explicitly assigned in the rule set to the document type or structure element. Then, for each institution it can be decided by demand, if the metadata is created or not.

Example:

<restriction division="NewspaperYear" unspecified="forbidden">
    <permit division="NewspaperMonth"/>
    <permit key="LABEL" maxOccurs="1"/>
    <permit key="ORDERLABEL" maxOccurs="1"/>
    <permit key="processTitle" maxOccurs="1"/>
</restriction>
matthias-ronge commented 5 months ago

If I understand correctly, this is just a configuration problem? If a metadata key has been marked as use="processTitle" and this metadata key is allowed in the ruleset for a division, then the process title is written there while the process is created. So far so logical.

So there are two options:

In this specific case, perhaps only one rule is missing in the ruleset, which does not allow metadata keys on the physSequence division:

<restriction division="physSequence" unspecified="forbidden">

    <permit division="page"/> <!-- paged media -->
    <permit division="track"/> <!-- a/v media -->
    <permit division="other"/> <!-- other media (like 3D objects) -->

    <!-- permit division="partitialOrder"/ --> <!-- possibly other divisions
                                                    if you work with two
                                                    structure trees / partial
                                                    orders -->
    <!-- NO METADATA keys at all! -->
</restriction>
BartChris commented 5 months ago

This would probably require the implementation of https://github.com/kitodo/kitodo-production/issues/4097

We definitely need documentation for the configuration of page and physSequence (once implemented) in the ruleset. Their absence can lead to unintended consequences if not properly configured. The issue lies not only in the lack of documentation but also in the fact that many institutions may not require the assignment of metadata elements to the physical structure. This aspect of adding metadata to the physical structure is also is not addressed in the METS-Anwendungsprofil. So the standard use case might never need this. While having the option is beneficial, improper configuration may result in unwanted DMDSecs. Therefore, it's crucial to ensure correct configuration by giving out this info in the documentation to avoid such issues.

andre-hohmann commented 5 months ago

I absolutely agree with @BartChris regarding the documentation.

We need "processTitle" for newspaper/ephemera processes (year and issues) and i only allow it for those division, but "processTitle" is still assigned to all processes.

Maybe the problem lies within the missing physSequence division in our ruleset. Thanks for the hint! I will check this as soon as possible.

BartChris commented 5 months ago

@andre-hohmann If i understand this issue correctly it is impossible right now to define rules for the physSequence. https://github.com/kitodo/kitodo-production/issues/4097

I have an additional question: Where in the UI would the user be able to edit the metadata for the element physSequence even if we make it configurable via the ruleset? And what kind of metadata would be typically added to the physSequence? I am still struggling seeing a real use case for that.

andre-hohmann commented 5 months ago

@BartChris :

If i understand this issue correctly it is impossible right now to define rules for the physSequence. https://github.com/kitodo/kitodo-production/issues/4097

Thanks! Then, i misunderstood @matthias-ronge hint:

Unfortunately, i will not be able to test changes of the configuration files. Or could it be, that in #4097 the aim is to create a use-attribute to define physSequence as root of the structure tree?

And what kind of metadata would be typically added to the physSequence? I am still struggling seeing a real use case for that.

I have neither an idea nor a use case for that, too.

I have an additional question: Where in the UI would the user be able to edit the metadata for the element physSequence even if we make it configurable via the ruleset?

Again, i have no idea and i think this should be worked out by the ones who wants to add metadata to the physSequence. Of course, this should be done in a way, that does not influence the current behavior in a negative way for the others users.

andre-hohmann commented 5 months ago

To avoid misunderstandings:

I would be glad, if processTitle would be in general applied to <mets:structMap TYPE="LOGICAL">. If other ones need it in physSequence, i am fine with it, as long as i do not have to apply it.

For me, it seems, as if the concept of implementation is not complete and/or the documentation is missing.

Probably, this should be discussed in a virtual conference, to find a solution.

And we have definitely adjust ruleset_default.xml, because with the current version, the described behavior occurs.

BartChris commented 5 months ago

In this specific case, perhaps only one rule is missing in the ruleset, which does not allow metadata keys on the physSequence division:

<restriction division="physSequence" unspecified="forbidden">

    <permit division="page"/> <!-- paged media -->
    <permit division="track"/> <!-- a/v media -->
    <permit division="other"/> <!-- other media (like 3D objects) -->

    <!-- permit division="partitialOrder"/ --> <!-- possibly other divisions
                                                    if you work with two
                                                    structure trees / partial
                                                    orders -->
    <!-- NO METADATA keys at all! -->
</restriction>

I just checked. This does unfortunately not solve the issue (in Kitodo 3.6.2).

andre-hohmann commented 5 months ago

Thanks!

Then we should definitely apply your suggestion.

If the physSequence should be applied, all necessary requirements should be available.

matthias-ronge commented 5 months ago

I just tested this: The root division of the physical structure tree has no type internally. That makes sense to a certain extent, but then the restriction must read accordingly:

<restriction division="" unspecified="forbidden">
    <permit division="page"/>
    <permit division="track"/>
    <permit division="other"/>
</restriction>

It's definitely not intuitive, but it works for me.

Still, this is a bug; but I think it would be better to implement #4097 than to change the code here now.

matthias-ronge commented 5 months ago

I have an additional question: Where in the UI would the user be able to edit the metadata for the element physSequence even if we make it configurable via the ruleset? And what kind of metadata would be typically added to the physSequence? I am still struggling seeing a real use case for that.

Is this possible if you use two structure trees? As a non-user, of course I don't know what kind of metadata is recorded there. But maybe that makes sense. For example, the physical media type, like hardcover or paperback. Or a former sales price. Or for a video cassette, the video format NTSC, PAL or SECAM, the tape type VHS, VHS-C, S-VHS, maybe the tape length in minutes, etc. Just things that you want to capture for the entire physical medium.

BartChris commented 5 months ago

I just tested this: The root division of the physical structure tree has no type internally. That makes sense to a certain extent, but then the restriction must read accordingly:

<restriction division="" unspecified="forbidden">
    <permit division="page"/>
    <permit division="track"/>
    <permit division="other"/>
</restriction>

It's definitely not intuitive, but it works for me.

Still, this is a bug; but I think it would be better to implement #4097 than to change the code here now.

I tested your suggestion: Adding this restriction for divisions without a type prevents the addition of an additional DMDSec, so it could be used as a workaround. If it does not have any unwanted side effects of course.

Is this possible if you use two structure trees

Are you talking about the extended Structure tree? That makes sense, but it seems like it is not possible to edit it there. I have the processTitle in the DMDSec for the physSequence if i do not apply the ruleset restriction:

<mets:dmdSec ID="uuid-1f89f1fc-48d1-34d9-b02c-c9ba849a748b">
        <mets:mdWrap>
            <mets:xmlData>
                <kitodo:kitodo>
                    <kitodo:metadata name="processTitle">retro_123456_000001</kitodo:metadata>
                </kitodo:kitodo>
            </mets:xmlData>
        </mets:mdWrap>
    </mets:dmdSec>

The metadatum however does not appear in the editor.

image

I can also not add any metadatum to the "Division without type" in the editor, but this is a different issue i suppose...

BartChris commented 5 months ago

We can assert the following:

In my opinion, the current situation is not entirely satisfactory.

I propose that we extend the documentation and the example ruleset, advising institutions to implement the following restrictions to mitigate potential side effects. These restrictions should continue to be applied even after a fix for #4097:

<restriction division="" unspecified="forbidden">
    <permit division="page"/>
    <permit division="track"/>
    <permit division="other"/>
</restriction>

My advice would also be to enable restrictions on pages:

<restriction division="page" unspecified="forbidden">
     <permit key="ORDERLABEL" maxOccurs="1"/>
</restriction>

While Kitodo's flexibility is a significant advantage, I suggest considering whether it might be better not to allow the assignment of metadata to the physical structure (physSequence and page) by default. Metadata assignment to these elements should be explicitly enabled in the ruleset to avoid institutions inadvertently creating unwanted DMDSecs. A wrongly configured ruleset should not lead to side effects which might be discovered too late in the process.

andre-hohmann commented 5 months ago

Thanks a lot for the analysis and the summary!

We can assert the following:

  • ...

In my opinion, the current situation is not entirely satisfactory.

I absolutely agree with you and in my opinion, this analysis should be conducted and these question should posed, before the functionality is implemented. I'm surprised that there don't seem to be any specific use cases or requirements for this change, which causes quite a lot of effort to clarify.
As soon as possible, i will test the changes in the SLUB Dresden and hope that it will work for us as well.

I would also like to draw attention to the following issue that will probably not be resolved by adjusting the ruleset:

andre-hohmann commented 4 months ago

The workaround described in the following comment works for me as well and so far no negative side effects have been identified.

I will therefore remove the label blocking.

I will also close the issue because @BartChris has created a discussion with the current state of knowledge and in which the next steps are to be determined.

Depending on these results, a new issue should be created in which the necessary actions are described.