slub / ocrd_manager

frontend for ocrd_controller and adapter towards ocrd_kitodo
MIT License
11 stars 3 forks source link

default workflow(s): utilise --lang and --script info #67

Open bertsky opened 11 months ago

bertsky commented 11 months ago

For example, with ocrd-tesserocr-recognize we could do something like:

shopt -s nocasematch
MODEL=eng
case "$LANGUAGE" in
  de|deu|ger) MODEL=deu
    case "$SCRIPT" in
      Fraktur) MODEL=frak2021;;
      ...
    esac;;
  fr|fre|fra) MODEL=fra;;
  hsb) MODEL=hsb
    case "$SCRIPT" in
      Fraktur) MODEL=hsbfraktur;;
      ...
    esac;;
  ...
esac

The question is: do we only apply this when no --workflow is supplied, or should we assume that all workflow files themselves may contain placeholders, e.g. $TESSMODEL, which we must replace on the fly?

bertsky commented 7 months ago

At least for ocrd-tesserocr-recognize, we can also parameterise dynamically (via XPath queries): xpath_model e.g.

{
"contains('de,deu,ger',@language) and starts-with(@script,'Latf')": "frak2021", 
"contains('fr,fre,fra',@language)": "fra", 
"@language='hsb') and starts-with(@script,'Latf')": "hsbfraktur",
"@language='hsb')": "hsblatin",
"": "eng"
}

And as workaround for the missing MODS inheritance, we could simply write a dummy processor that fills the respective PAGE attributes from MODS...