kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
63 stars 63 forks source link

Restore the functionality to create processes using ActiveMQ #4837

Open BartChris opened 2 years ago

BartChris commented 2 years ago

Hello,

it would be good if the functionality to create processes using Active MQ could be restored by reactivating and updating the old CreateNewProcessProcessor

https://github.com/kitodo/kitodo-production/blob/dc9f6fbf8636e884fc9a283a445cc04fab4d2d02/Kitodo/src/main/java/org/kitodo/production/interfaces/activemq/CreateNewProcessProcessor.java

aetherfaerber commented 1 year ago

As I already explained in https://github.com/kitodo/kitodo-production/issues/5572#issuecomment-1456354824 we need this functionality. I see active development regarding ActiveMQ over at https://github.com/kitodo/kitodo-production/pull/5657 but no prominent mention of the Process Creation Service. Will the service be restored as a byproduct of this development?

If not, is there any documentation or even better an estimation why the service could not be restored in Kitodo 3? What are the hurdles and the expected amount of work necessary to restore it?

matthias-ronge commented 10 months ago

After discussing the topic in the BarCamp at this year's Kitodo Practical Meeting, the following requirements emerged in detail:

Example:

{ "project": 88,
  "template": 123,
  "import": [
    { "importconfiguration": 7,
      "value": "1234567X" },
    { "importconfiguration": 9,
      "value": "Hs_sig_08.225.1234" }
  ],
  "title": "LoremIpsum_1234567X",
  "parent": 22,
  "metadata": {
    "singleDigCollection": ["Kollektion 1", "Kollektion 2"],
    "person": [
      { "role": ["aut", "edt"],
        "firstName": "Michael",
        "lastName": "Schmidt"
      },
      { "role": ["egr"],
        "firstName": ["Johannes"],
        "lastName": ["Römer"]
      }
    ]
  }
}

Here, { } stands for a javax.jms.MapMessage, [ ] for a java.util.List. (Putting lists and maps as objects in a map is an extension of the JMS standard provided by Apache ActiveMQ.)

A separate map message must be sent using Active MQ for each process to be created. Successful process creation or error is reported on a results topic. General rule of thumb: If something doesn't fit together, no process is created, but a helpful error message is sent on the results topic.

BartChris commented 10 months ago

I am not exactly sure how complex ActiveMQ JMS message objects can be constructed using other languages than Java, but for anyone who wants to give it a try here is the general procedure for using python with the package stomp.py. (https://github.com/jasonrbriggs/stomp.py)

import stomp

conn = stomp.Connection([("localhost", 61613)], auto_content_length=False)
conn.connect(
        [username],
        [password],
        wait=True,
    )

conn.send(
        # hardcoded queue
        destination="/queue/KitodoProduction.FinalizeStep.Queue",
        body=to_map_json(
            {
                "id": [Kitodo_Task_id],
            }
        ),
        transformation="jms-map-json",
    )

def to_map_json(obj):
    return json.dumps(
        {
            "map": {
                "entry": [
                    {"string": [str(key), str(value)]} for key, value in obj.items()
                ]
            }
        }
    )
henning-gerhardt commented 10 months ago

It is even possible to create such ActiveMQ JMS messages in PHP. I'm using for this the stomp-php package and the code is relative to the python solution. So I think it is possible to use different languages to create the necessary ActiveMQ JMS message.

aetherfaerber commented 10 months ago

Thanks a lot for writing this very clear summary. It fits the requirements quite well. Nontheless, I have a few remarks and questions or additions from my notes:

After discussing the topic in the BarCamp at this year's Kitodo Practical Meeting, the following requirements emerged in detail:

  • Mandatory information is the project ID and the process template.
  • Catalog imports can be specified (none, one or more). An importconfiguration and a search value must be specified. The search is carried out in the default search field. If no hit is found, or more than one, the search aborts with an error message. In the case of multiple imports, a repeated import is carried out according to the procedure specified in the rule set.

IIRC we did not explicitely discuss the case of no catalog import being specified or more than one being specified. Both seem welcome additions but, again from my notes, have not been discussed as requirements. Nontheless a few questions:

  • A process title can optionally be specified. If it is specified explicitly, exactly this process title is used, otherwise the system creates the process title according to the configured rule. The process title must still be unused for the client who owns the project.

IIRC (I have no notes on this) we discussed that technically Kitodo would also prevent you from creating a process of the same title that is already in use by another client. Having the possibility to create processes of identical name for different clients would be nice and provide better multitenancy but is out of scope here. So if I'm not mistaken it should be “The process title must still be unused.” Since we decided that processes with errors in imports should not be created at all I would conclude that processes with a supplied process name that is already in use should also not be created at all, even if a process title that is not already in use could be derived through the configured rule.

  • A parent process can optionally be specified. The process ID or the process title can be specified. (If the value is all digits (^\d+$), it is considered the process ID, else it is considered the process title.) The process must be found in the client’s processes. If no parent process is specified, but a metadata entry with a use="higherLevelIdentifier" is included in the data from the catalog, the parent process is searched for using the metadata entry with use="recordIdentifier". It must already exist for the client. No parent process is implicitly created. The child process is added at the last position in the parent process.

According to the notes I made during the session we also explicitly decided that the first check should be on the process ID. I.e. if both a process ID and process name are provided, the lookup result for the process ID should be queried first. As I understood our decision from the workshop if a provided identifier could not be found in the system the process creation should be aborted. That means even a higher level identifier could be found ther would be no process created if a process name was provided but could not be found. I don't entirely understand the reasoning behind this so it would be nice if someone could comment on this. Maybe I just misunderstood when the process should fail.

We also discussed that in the case that a superordinate process was found but is part of another project it is possible that the process could not be created as subordinate process since access rights to the other project are missing. In this case a process should be created without connection to the supposed superordinate process. (In accordance with the current behaviour of the mass import feature.)

I find that contradictory to the decision described above and would personally prefer that if a provided process id can not be found but a provided process name can be matched the process should be created as suordinate process to the process having the provided name.

@BartChris @apiller since you were also part of the group maybe you could comment on this?

  • Additionally metadata can be passed. Passing multiple metadata or passing grouped metadata is also possible.

Example:

{ "project": 88,
  "template": 123,
  "import": [
    { "importconfiguration": 7,
      "value": "1234567X" },
    { "importconfiguration": 9,
      "value": "Hs_sig_08.225.1234" }
  ],
  "title": "LoremIpsum_1234567X",
  "parent": 22,
  "metadata": {
    "singleDigCollection": ["Kollektion 1", "Kollektion 2"],
    "person": [
      { "role": ["aut", "edt"],
        "firstName": "Michael",
        "lastName": "Schmidt"
      },
      { "role": ["egr"],
        "firstName": ["Johannes"],
        "lastName": ["Römer"]
      }
    ]
  }
}

Here, { } stands for a javax.jms.MapMessage, [ ] for a java.util.List. (Putting lists and maps as objects in a map is an extension of the JMS standard provided by Apache ActiveMQ.)

A separate map message must be sent using Active MQ for each process to be created. Successful process creation or error is reported on a results topic. General rule of thumb: If something doesn't fit together, no process is created, but a helpful error message is sent on the results topic.

aetherfaerber commented 10 months ago

Here are the flipchart notes I wrote during the session

matthias-ronge commented 10 months ago

if a provided process id can not be found but a provided process name can be matched the process should be created as suordinate process to the process having the provided name

The parent field can be either a process ID or a process title, but not both. If it is a sequence of numbers, it is a process ID, otherwise it is a process title. If the parent process is found using this characteristic, the new process is created as a child. If the parent process is not found, no process is created and an error message appears.

aetherfaerber commented 10 months ago
  • An additional import will merge metadata from several sources into one process, according to the behaviour defined in the <editing> section of the use ruleset (cf. Make behaviour of repeated import configurable #5613). This is also available when creating a process manually („additional import“ switch on the bottom of the import dialog). For example, you can fetch a stump record from the main catalog, and add metadata from a rich database.

Thank you for this explanation. We haven't discovered this feature yet and don't have a use case for it at the moment.

  • AFAIK, the database does not automatically reject duplicate titles (agnostic of the client). However, within the process creation service, a search is conducted to prevent to create the same process title twice for the same client.

Yes, I think that the mechanics behind that were also discussed during the session. Anyway, you cannot create a process (via mass import) with a title that is already in use. I'm not really concerned right now if this is possible for different clients.

  • I remember that we addressed the "no catalog import" and "multiple imports" cases, and they should behave as I outlined. In this case, a process title must be given, or it must be possible to build one based on rules from the given metadata.

Good, I must have missed that.

  • If there is no functional metadata of use="docType", we haven't discussed that. When creating a process manually, the alphabetically first Doc Type is preselected and used, if not changed. However, IMHO it would make sense to be an error case.

I agree that this should be an error case if docType is a required field.

  • I believe to remember, at this point, there should not be a check if the parent process is in a different project. The question of adding permissions to better control this behaviour was discussed in this BarCamp group, but I see it out of the scope of this issue. As for the moment the system behaviour was changed to allow cross-project dependencies of processes, it should also be allowed by remote activity.

Yes, as with possible authentication systems we briefly discussed this but found it to be out of scope.

if a provided process id can not be found but a provided process name can be matched the process should be created as suordinate process to the process having the provided name

The parent field can be either a process ID or a process title, but not both. If it is a sequence of numbers, it is a process ID, otherwise it is a process title. If the parent process is found using this characteristic, the new process is created as a child. If the parent process is not found, no process is created and an error message appears.

Sorry for being unclear about that. Let's assume than that a process ID is transmitted via the ‘parent’ field then and there is a metadata entry in the field that serves as higherLevelIdentifier. We decided that the ‘parent’ field should be checked first. But what happens if the ID cannot be found? Does the process creation service stop or will the higherLevelIdentifier be used?

matthias-ronge commented 10 months ago

Let's assume than that a process ID is transmitted via the ‘parent’ field then and there is a metadata entry in the field that serves as higherLevelIdentifier. We decided that the ‘parent’ field should be checked first. But what happens if the ID cannot be found?

The process will not be created. The value from the higherLevelIdentifier will only be used if no ‘parent’ field is given.

aetherfaerber commented 10 months ago

Let's assume than that a process ID is transmitted via the ‘parent’ field then and there is a metadata entry in the field that serves as higherLevelIdentifier. We decided that the ‘parent’ field should be checked first. But what happens if the ID cannot be found?

The process will not be created. The value from the higherLevelIdentifier will only be used if no ‘parent’ field is given.

I'm okay with this behaviour. I just couldn't really make out the reasoning behind it and wanted to make sure I understood and summarized correctly. Thank you for confirming this assumption.