slub / ocrd_manager

frontend for ocrd_controller and adapter towards ocrd_kitodo
MIT License
11 stars 3 forks source link

allow re-running #26

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

Fixes https://github.com/markusweigelt/kitodo_production_ocrd/issues/17

Note that due to https://github.com/OCR-D/core/issues/825 this still does not work completely, but there's nothing we can do about that on our side.

bertsky commented 2 years ago

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process. (We currently have no entry point for rerunning jobs from Kitodo.)

markusweigelt commented 2 years ago

@bertsky Should we create a make target to delete files of WORKDIR of Manager or Controller? Or should we add a parameter to force deleting WORKDIRS before processing?

bertsky commented 2 years ago

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process.

But thinking about it, it may at least help to name REMOTE_DIR without the variable $PID, only with $PROCESS_ID and $TASK_ID, which are constant and therefore should allow for rerunning the job (regardless of how this might be triggered). What do you think?

bertsky commented 2 years ago

Should we create a make target to delete files of WORKDIR of Manager or Controller? Or should we add a parameter to force deleting WORKDIRS before processing?

No, I wouldn't do that via makefile. There's a to-do in the comments to cron-schedule the removal on the Controller, which should suffice. And for the Manager, the same mechanism that removes finished process data should also be responsible for the ocr-d/ side, so that's outside of the Manager's scope.

markusweigelt commented 2 years ago

Also, note that this does not cope with re-entering the same job again (which would also make it re-use the Controller-side REMOTE_DIR), only the same process.

But thinking about it, it may at least help to name REMOTE_DIR without the variable $PID, only with $PROCESS_ID and $TASK_ID, which are constant and therefore should allow for rerunning the job (regardless of how this might be triggered). What do you think?

I think the $TASK_ID is not suitable to put this value in the scope of the Controller, cause it is a Kitodo.Production specific and Controller should be independent from application who triggers ocr process. I think it should be the $PROCESS_ID with prefix of application e.g. "Production", "Presentation" hand over by the Manager or something else to distinguish between scripts e.g. for_production, for_presentation ...

bertsky commented 2 years ago

I think the $TASK_ID is not suitable to put this value in the scope of the Controller, cause it is a Kitodo.Production specific and Controller should be independent from application who triggers ocr process. I think it should be the $PROCESS_ID with prefix of application e.g. "Production", "Presentation" hand over by the Manager or something else to distinguish between scripts e.g. for_production, for_presentation ...

It's not in the "scope" of the Controller, though. It's the Manager's choice. The Controller just gets a path name (ideally not conflicting with anything else). And I think that having TASK_ID in there is actually correct: Suppose you have a workflow with two places for OCR(-D): once for the page layout and text on the images, then some more steps in Production including export, and then again for document layout on the presentation METS. The second time must not clash with the first time, i.e. it should have different directories on the Controller.

markusweigelt commented 2 years ago

Ok that is a good point. I have to think about this a bit more because tasks can be deleted etc. but for the moment it sounds like the best way.