kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
62 stars 63 forks source link

Taskmanager: use persistence list of tasks #4136

Open henning-gerhardt opened 3 years ago

henning-gerhardt commented 3 years ago

Taskmanager implementation of 2.x and 3.x are using non-persistence lists to use and access the list of tasks. If there is any kind of error and the application server (Tomcat, ...) must be restarted then all open tasks in this list got lost. Restoration of this tasks is difficult, time consuming and not all tasks can be restored (f.e. task of creating hundreds of newspaper like processes).

matthias-ronge commented 3 years ago

For the latter, see also #3677

matthias-ronge commented 3 years ago

Persisting the task list is not possible without further ado, since the tasks are running Java Threads.

Basically, there are two ways of approaching the requirement. The threads are already implemented so that they can be stopped and continued. For this purpose, their respective tasks are internally divided into smallest sensible units of work, the status is kept in object instance fields. If a thread is interrupted, it terminates after the current step has been completed. When it is restarted, a new Thread object is created and the object instance fields are copied to the new object.

We could change the task manager so that it interrupt()s every thread immediately after it starts. The thread then only performs one smallest step and exits. When all running threads have stopped, the task manager saves the status of all threads in a file. Then it lets run each thread one smallest step again. Of course, this assumes that the stop and start function for every type of task generally works, and the object instance fields must be serializable, for example, no database objects, only their IDs.

However, the approach would have many disadvantages. If the Tomcat is killed, incomplete states can still remain in the system. I'm not sure how safe Hibernate is against saving inconsistent states if it is used as-is. (For example, if the objects have already been created for a n:n relation, but their links are not yet in the crosstab.) Or, a directory for a process has been created, some of the subdirectories, but not all, and no metadata file yet. In the worst case, a metadata file was half written and the byte stream was interrupted in the middle, maybe even a multi-byte UTF-8 character was written in part.

In addition, that would add a lot of overhead and slow down the tasks running at the same time, since each task only takes a minimal step and then waits as long as the minimal step of the slowest task takes to complete. In addition, persisting takes a little time every time, too.

What you want is transaction processing. This must be designed to be failsafe in every step. That means, a Hibernate change has to use the transaction API (which is available!). A file must be written to a temporary file and then be renamed atomically to the target file. A control file must first be created for each step. When a process is restarted after an interruption, it must be checked for each control file whether it has been partially processed, incomplete processing must be deleted (regardless of how far the processing was) and then start over. Only when everything is ready for the step, the control file can be deleted again. This is all possible, but it is very cumbersome.

You should also note that not only the task manager has running threads, but also the workflow engine, for example waiting for a shell script to be executed. (Depending on the implementation of the shell script, this can take hours or even days.)

henning-gerhardt commented 3 years ago

Thank you, Matthias. Your suggestion is one possible solution and I think there are many more. I did not want to implement this from scratch and use more established tools / libraries / frameworks which did all the base work already and fixed a lot of possible bugs.

matthias-ronge commented 3 years ago

I would suggest closing this ticket as invalid. What you are proposing is not possible in the current system. The alternative of re-implementing it is already covered by issue #4137. Do you agree with that?

henning-gerhardt commented 3 years ago

No, as this is a standalone issue in my opinion and should only be considered only as one aspect in #4137.