mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
732 stars 130 forks source link

use multiple language in ocr #230

Closed josef821 closed 3 years ago

josef821 commented 3 years ago

hi. i have two model for two language : English French Both languages are used in my pictures and i want to ocr them. Part in English and part in French how can i use them same time ? i run this command but is show error : kraken -i w2.png image.txt binarize segment ocr -m fr.mlmodel -m en_best.mlmodel note : i trained this two model myself. ERROR : Error: Invalid value for "-m" / "--model": Mappings must be in format script:model

example image : https://www.researchgate.net/profile/Debra_Titone/publication/269876498/figure/tbl1/AS:601775467925516@1520485860767/Sample-Experimental-Sentences-in-English-and-French.png

mittagessen commented 3 years ago

The problem is that you need a segmenter that separates the languages first to then run the recognizer with different models on them. Your example is a bit difficult as the typefaces are exactly the same and even if not you've got two languages in the same script which is probably easier dealt with by a combined model for English and French.

The general syntax for multi-model recognition is:

-m $script:$model, e.g. -m french:model_1 -m english:model_2

but as mentioned this requires a segmenter that produces these identifiers (or an XML input file which has the appropriate identifiers for each line).

All Best, Ben

josef821 commented 3 years ago

thanks for quick reply, in this example yes, but in something like this : djgfd in some line two language will combine or in some pages like this : largepreview arabic and english together. can you explain more or send a example XML file for every line ( however it's not working for combined two language text in one line ) ? how can i do this. thank you

mittagessen commented 3 years ago

So this is going to be a bit long. Originally, the multi-model facility was designed for the old segmenter which had a model that was able to distinguish between different scripts inside a line (Arabic, Latin, Syriac, ...) and would allow you to run different models on each of those runs. Unfortunately, that broke with the new pytorch backend whose CTC implementation doesn't permit the segmentation into runs (the activations are different).

There is a new segmenter that uses baselines and is fully trainable. This one is able to be trained to output different types for lines based on Alto/PageXML training data. The recognizer also changed to accomodate the new baseline format and the recognition models trained on the old bounding boxes and the new baselines are not interchangeable. An explanation of the baseline format is here [0] (a bit specific on Arabic but the gist should be clear). A template for the XML format is in docs/pagexml.xml including the types. The format is the same for training the segmenter and feeding data into the recognizer directly side-stepping the segmenter. The new segmenter isn't extensively documented yet but the command ketos segtrain works more or less the same as ketos train, just taking XML files instead of the line strips. Likewise you can train recognition models directly with ketos train now, just add the -f xml option to switch from the old path format.

I hope that helps.

[0] https://arxiv.org/pdf/1907.04041

josef821 commented 3 years ago

can you share example command for this description? In practice, I still can not figure out exactly what to do? can i do this train with kraken to recognize language of line? please share example code or command. thank you.

josef821 commented 3 years ago

can i train two language in one model file? or combine two model file? for example i create grand-truth images and use multiple language. some english line (ltr) and some arabic line (rtl)?