kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
60 stars 65 forks source link

Complete the conversion from properties to metadata #4317

Open matthias-ronge opened 3 years ago

matthias-ronge commented 3 years ago

In Production 2, so-called properties were displayed and editable in each process:

Screenshot 2021-03-30 140530

They could also be edited in each task:

Screenshot 2021-03-30 140235

They could also be edited for a whole batch: (theoretically, but this is buggy in version 2)

Screenshot 2021-03-30 141657

These properties could be configured using a separate configuration file goobi_processProperties.xml. The properties were essentially metadata, but process-specific properties could also be noted. In Production 2, the properties could be searched, while search for metadata (in file meta.xml) was not possible. Properties gave access to three categories of metadata, namely from="vorlage", from="werk" and from="prozess", which correspond to sourceMD, dmdSec and techMD in METS format.

In Production 3, all of this should be summarized as part of a standardization. The aim was to keep all metadata in file meta.xml, and this is described in a ruleset file. In it, there are so-called <acquisitionStage>s, when which metadata is to be recorded. The aim was, in every workflow step, an acquisition stage can be performed. The search for metadata in meta.xml is possible in Production 3 using the search engine function. However, the acquisition stages feature is incomplete because the development time has expired and development is not yet complete. Currently, there is no replacement for the feature function.

The generation of dockets and TIFF-header files still depend on properties and have to be changed.

To-do

Estimated Cost and Complexity

This is a mid-range project for about 10 PT.

solth commented 3 years ago

The current implementation of Kitodo 3 also allows for displaying arbitrary property values in columns in the process and task list, so they are definitely still in use, even outside the swiss project.

henning-gerhardt commented 3 years ago

Displaying and using this properties are not the same. We (SLUB) are displaying them as this is done by Kitodo.Production but we are not using them (like in scripts calls or somewhere else) as all this property data is available in the meta data files too.

andre-hohmann commented 3 years ago

From the users' perspective, i want to point on some aspects that makes the use of the properties difficult:

  1. In Kitodo.Production 2.x, the properties are split in "Physical templates" and "Workpieces". In Kitodo.Production 3.x, all properties seem to be stored in "Workpieces".
  2. In Kitodo.Production 2.x, the labels and the values of the properties are defined in the kitodo_projects.xml file. In some cases, there are differences to the labels and values of the metadata. This is not possible anymore in Kitodo.Production 3.x, which leads to some inconsistencies in the values and labels.
  3. New hierarchy-processes have no properties (#4271).

I do not rely on the current state as basis for statistical analysis. Maybe there are other scenarios in which the properties can be used. I can image that it is difficult to explain new users the properties and i hope, we can ignore them.

However, i strongly recommend a clear statement regarding the properties. From the beginning of the project, it is mentioned, that they are not needed and that they are needed. This is very confusing and i cannot explain it.

matthias-ronge commented 3 years ago

First, in general, this task is to complete a change process that was envisaged in the implementation of Production 3 and has not yet been completed. It is about the "standardization of the XML configuration" item, which was part of the development project. The fact that some things are still there or not yet there is caused by the fact that the development is not yet complete.

I would very much like to completely remove old properties, it is a concession to you, @solth that there is still one database table, because you used it in the Swiss project and you said it cannot be removed because they depend on it. Everyone else shouldn't use it in Production 3, because it is something that by definition no longer exists and that should be replaced.

Commenting on @andre-hohmann comments:

  1. This is also possible in Production 3, there it says in the ruleset domain= with the values "source" (old: from="vorlage"), "description" (old: from="werk") and "technical" (old: from="prozess"), and in addition, "rights" and "digitalProvenance" is possible, which didn’t exist in version 2.
  2. In Production 3, the labels and values are defined in the ruleset. This was part of the goal of unifying all these things as part of the development of Production 3. If there are differences, they need be clean up. This needs an intellectual decision of what you really want. If two things are different, they need to be stored in different fields with different labels (even if they are mapped to the same during export). If two things are the same, chose one label that explains what that thing is. If you have a really good case of why the same thing needs different labels in the metadata editor, than in the task, please give an example.
  3. … and they don’t need any as the properties are subject to removal.

I don’t understand your next sentence. What do you want to conduct statistical analysis on? On metadata? Can you give examples?

As far as I understand it, properties were necessary in version 2 because they are in the database, and version 2 has no search engine, so in order to search for metadata, you had to copy it into the database. In version 3, there is the search engine for finding metadata, and therefore no longer need to be copied into the database. Properties can be changed in version 2, but it is not changed in metadata, and subsequent changes in metadata are not changed in properties, so this leads to inconsistencies and was never really clean for use. The only case that properties are used in version 3, for my known, is the Swiss project. It must be said, that that is a separate fork, so that other source code functions may access it, which don’t exist in Production 3. Nevertheless, when I removed properties, I was told that we must not remove them completely, because of the Swiss project. I respect that. @solth, please explain, is this still the current status, and is it still necessary?

andre-hohmann commented 3 years ago

I don’t understand your next sentence. What do you want to conduct statistical analysis on? On metadata? Can you give examples?

We use the properties to retrieve for example:

For me it is absolutely fine to remove the properties. We will not use them anymore, as it seems, that the search engine is implemented in a usable way. If the properties are still necessary, it should be described, how they should be applied - respectively it should be described, when they should not be applied.

I just want to point out that the current state is confusing, especially for new users.

stefanCCS commented 3 years ago

Let me also put a comment, please. First, I absolutely understand, that current situation is something, which has been started but not finalized to 100%. E.g., looking at the database and see 3 different tables handling linkage of properties to processes (process_x_property, workpiece_x_property, template_x_property(this one looks like not to be used anymore?!)), indeed, is confusing. Additonal, looking at the content of the propeties and see, that mainly, they are a duplicate to meta data also available in meta.xml is also not very consistent. On the other hand, I can see maybe a few properties, which are not (directly) dedicated to the meta.xml (e.g. "Template" used). Also, I like the idea to have some "identifyer" resp. "main parameter" directly in the database (e.g. like ID, title and DocType). Of course, these can be directly modelled in the process table (which has been done already partly). --> to summarize my opinion: Yes, I agree to cleanup (remove) properties. But, please take care of parameters, which are not (directly) related to the meta.xml (or that important, that it makes sense to model this as a duplicate, like ID,Title.DocType (maybe more?).

matthias-ronge commented 3 years ago

Yes, we take care of it. The ID of a process is in the database field process.id, the template is field process.template_id, and the title is in field process.title. DocType is metadata (precisely, the TYPE of the outermost <div> in the <structMap TYPE="LOGICAL"> ) and is therefore only in the METS file, which is currently intentional. It can be accessed from the code easily with processService.getBaseType(process). From external scripts, you may use regular expressions to get it, something like: <mets:structMap TYPE="LOGICAL">\s*<mets:div[^>]*?TYPE="([^">]*)"

matthias-ronge commented 4 months ago

The current implementation of Kitodo 3 also allows for displaying arbitrary property values in columns in the process and task list, so they are definitely still in use, even outside the swiss project.

In the process list and task list, it should be possible to display metadata instead of properties (and ideally sort by that, but that is outside the scope of this issue).

solth commented 4 months ago

Votes: 6